Matches in Nanopublications for { ?s ?p ?o <https://w3id.org/np/RAgewjyjDip-au9b3ZvDRaEnwXZQ_hUVZkFjOzI5mISSA/assertion>. }
- 200 type Project assertion.
- 81beeb9e-e49d-4bbb-bea1-bf21e168b422 type TimeReference assertion.
- ee6fb1ee-e7de-42d7-82e0-ed5e33bb3c36 type ImageObject assertion.
- 0000-0003-2388-0744 type Agent assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d mainEntity "workflow_hpAugmentation/Snakefile" assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d importedBy 0000-0003-2388-0744 assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 sdPublisher "https://about.workflowhub.eu/" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 image "reduced_workflow.png" assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d creativeWorkStatus "Stable" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 producer "https://workflowhub.eu/projects/200" assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d description dd5c3d62-b632-46a1-99e4-761f2e6cb60d assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d description "## Summary HPPIDiscovery is a scientific workflow to augment, predict and perform an insilico curation of host-pathogen Protein-Protein Interactions (PPIs) using graph theory to build new candidate ppis and machine learning to predict and evaluate them by combining multiple PPI detection methods of proteins according to three categories: structural, based on primary aminoacid sequence and functional annotations.<br> HPPIDiscovery contains three main steps: (i) acquirement of pathogen and host proteins information from seed ppis provided by HPIDB search methods, (ii) Model training and generation of new candidate ppis from HPIDB seed proteins' partners, and (iii) Evaluation of new candidate ppis and results exportation. (i) The first step acquires the identification of the taxonomy ids of the host and pathogen organisms in the result files. Then it proceeds parsing and cleaning the HPIDB results and downloading the protein interactions of the found organisms from the STRING database. The string protein identifiers are also mapped using the id mapping tool of uniprot API and we retrieve the uniprot entry ids along with the functional annotations, sequence, domain and kegg enzymes. (ii) The second step builds the training dataset using the non redundant hpidb validated interactions of each genome as positive set and random string low confidence ppis from each genome as negative set. Then, PredPrin tool is executed in the training mode to obtain the model that will evaluate the new candidate PPIs. The new ppis are then generated by performing a pairwise combination of string partners of host and pathogen hpidb proteins. Finally, (iii) in the third step, the predprin tool is used in the test mode to evaluate the new ppis and generate the reports and list of positively predicted ppis. The figure below illustrates the steps of this workflow. ## Requirements: * Edit the configuration file (config.yaml) according to your own data, filling out the following fields: - base_data: location of the organism folders directory, example: /home/user/data/genomes - parameters_file: Since this workflow may perform parallel processing of multiple organisms at the same time, you must prepate a tabulated file containng the genome folder names located in base data, where the hpidb files are located. Example: /home/user/data/params.tsv. It must have the following columns: genome (folder name), hpidb_seed_network (the result exported by one of the search methods available in hpidb database), hpidb_search_method (the type of search used to generate the results) and target_taxon (the target taxon id). The column hpidb_source may have two values: keyword or homology. In the keyword mode, you provide a taxonomy, protein name, publication id or detection method and you save all results (mitab.zip) in the genome folder. Finally, in the homology mode allows the user to search for host pathogen ppis giving as input fasta sequences of a set of proteins of the target pathgen for enrichment (so you have to select the search for a pathogen set) and you save the zip folder results (interaction data) in the genome folder. This option is extremely useful when you are not sure that your organism has validated protein interactions, then it finds validated interactions from the closest proteins in the database. In case of using the homology mode, the identifiers of the pathogens' query fasta sequences must be a Uniprot ID. All the query protein IDs must belong to the same target organism (taxon id). - model_file: path of a previously trained model in joblib format (if you want to train from the known validated PPIs given as seeds, just put a 'None' value) ## Usage Instructions The steps below consider the creation of a sqlite database file with all he tasks events which can be used after to retrieve the execution time taken by the tasks. It is possible run locally too (see luigi's documentation to change the running command). <br><br> * Preparation: 1. ````git clone https://github.com/YasCoMa/hppidiscovery.git```` 2. ````cd hppidiscovery```` 3. ````mkdir luigi_log```` 4. ````luigid --background --logdir luigi_log```` (start luigi server) 5. conda env create -f hp_ppi_augmentation.yml 6. conda activate hp_ppi_augmentation 6.1. (execute ````pip3 install wget```` (it is not installed in the environment)) 7. run ````pwd```` command and get the full path 8. Substitute in config_example.yaml with the full path obtained in the previous step 9. Download SPRINT pre-computed similarities in https://www.csd.uwo.ca/~ilie/SPRINT/precomputed_similarities.zip and unzip it inside workflow_hpAugmentation/predprin/core/sprint/HSP/ 10. ````cd workflow_hpAugmentation/predprin/```` 11. Uncompress annotation_data.zip 12. Uncompress sequence_data.zip 13. ````cd ../../```` 14. ````cd workflow_hpAugmentation```` 15. snake -n (check the plan of jobs, it should return no errors and exceptions) 16. snakemake -j 4 (change this number according the number of genomes to analyse and the amount of cores available in your machine)" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 description "## Summary HPPIDiscovery is a scientific workflow to augment, predict and perform an insilico curation of host-pathogen Protein-Protein Interactions (PPIs) using graph theory to build new candidate ppis and machine learning to predict and evaluate them by combining multiple PPI detection methods of proteins according to three categories: structural, based on primary aminoacid sequence and functional annotations.<br> HPPIDiscovery contains three main steps: (i) acquirement of pathogen and host proteins information from seed ppis provided by HPIDB search methods, (ii) Model training and generation of new candidate ppis from HPIDB seed proteins' partners, and (iii) Evaluation of new candidate ppis and results exportation. (i) The first step acquires the identification of the taxonomy ids of the host and pathogen organisms in the result files. Then it proceeds parsing and cleaning the HPIDB results and downloading the protein interactions of the found organisms from the STRING database. The string protein identifiers are also mapped using the id mapping tool of uniprot API and we retrieve the uniprot entry ids along with the functional annotations, sequence, domain and kegg enzymes. (ii) The second step builds the training dataset using the non redundant hpidb validated interactions of each genome as positive set and random string low confidence ppis from each genome as negative set. Then, PredPrin tool is executed in the training mode to obtain the model that will evaluate the new candidate PPIs. The new ppis are then generated by performing a pairwise combination of string partners of host and pathogen hpidb proteins. Finally, (iii) in the third step, the predprin tool is used in the test mode to evaluate the new ppis and generate the reports and list of positively predicted ppis. The figure below illustrates the steps of this workflow. ## Requirements: * Edit the configuration file (config.yaml) according to your own data, filling out the following fields: - base_data: location of the organism folders directory, example: /home/user/data/genomes - parameters_file: Since this workflow may perform parallel processing of multiple organisms at the same time, you must prepate a tabulated file containng the genome folder names located in base data, where the hpidb files are located. Example: /home/user/data/params.tsv. It must have the following columns: genome (folder name), hpidb_seed_network (the result exported by one of the search methods available in hpidb database), hpidb_search_method (the type of search used to generate the results) and target_taxon (the target taxon id). The column hpidb_source may have two values: keyword or homology. In the keyword mode, you provide a taxonomy, protein name, publication id or detection method and you save all results (mitab.zip) in the genome folder. Finally, in the homology mode allows the user to search for host pathogen ppis giving as input fasta sequences of a set of proteins of the target pathgen for enrichment (so you have to select the search for a pathogen set) and you save the zip folder results (interaction data) in the genome folder. This option is extremely useful when you are not sure that your organism has validated protein interactions, then it finds validated interactions from the closest proteins in the database. In case of using the homology mode, the identifiers of the pathogens' query fasta sequences must be a Uniprot ID. All the query protein IDs must belong to the same target organism (taxon id). - model_file: path of a previously trained model in joblib format (if you want to train from the known validated PPIs given as seeds, just put a 'None' value) ## Usage Instructions The steps below consider the creation of a sqlite database file with all he tasks events which can be used after to retrieve the execution time taken by the tasks. It is possible run locally too (see luigi's documentation to change the running command). <br><br> * Preparation: 1. ````git clone https://github.com/YasCoMa/hppidiscovery.git```` 2. ````cd hppidiscovery```` 3. ````mkdir luigi_log```` 4. ````luigid --background --logdir luigi_log```` (start luigi server) 5. conda env create -f hp_ppi_augmentation.yml 6. conda activate hp_ppi_augmentation 6.1. (execute ````pip3 install wget```` (it is not installed in the environment)) 7. run ````pwd```` command and get the full path 8. Substitute in config_example.yaml with the full path obtained in the previous step 9. Download SPRINT pre-computed similarities in https://www.csd.uwo.ca/~ilie/SPRINT/precomputed_similarities.zip and unzip it inside workflow_hpAugmentation/predprin/core/sprint/HSP/ 10. ````cd workflow_hpAugmentation/predprin/```` 11. Uncompress annotation_data.zip 12. Uncompress sequence_data.zip 13. ````cd ../../```` 14. ````cd workflow_hpAugmentation```` 15. snake -n (check the plan of jobs, it should return no errors and exceptions) 16. snakemake -j 4 (change this number according the number of genomes to analyse and the amount of cores available in your machine)" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 version "1" assertion.
- Workflow-RO-Crate version "0.2.0" assertion.
- bf79d072-d5db-4eb9-b7fe-602b9bb7243c contentSize "0" assertion.
- 572775af-ed1f-4c29-8aa5-884686ab257d contentSize "24" assertion.
- a9f6e252-5e53-42a4-8a7f-b4a97b1c40cd contentSize "24" assertion.
- c16775dc-e7c3-486c-990c-3be2c5ea7a1a contentSize "24" assertion.
- 15c39d0d-0f43-4d64-8eba-6404356a5adf contentSize "101" assertion.
- 55ecaa21-84d9-42ec-b9bc-167cf49afa09 contentSize "110" assertion.
- c16364f9-cdaf-4f69-8b2d-24ec86e0760a contentSize "121" assertion.
- ed49a599-f8a8-4c12-a05d-3712467babdf contentSize "162" assertion.
- 9456290e-07e1-4286-b1f0-1dd722a7b653 contentSize "172" assertion.
- 7776b24b-a086-4195-9f8f-056836c6a0a0 contentSize "236" assertion.
- 24f02bf4-0975-45cf-9f5a-ec39be7e52c5 contentSize "239" assertion.
- 12c2cb86-f0cb-4af4-9573-fe75e143f7e8 contentSize "371" assertion.
- 7d719c89-cfba-4eca-8341-ac6daa6b370f contentSize "400" assertion.
- 71bd00c7-2443-4a7c-b15f-63bab9d9e1aa contentSize "488" assertion.
- 06814c66-d6f5-4d56-8024-1c8d2e58a4db contentSize "1055" assertion.
- 21dfa77e-a9f7-4ad4-badb-f8dcc1bcc931 contentSize "1363" assertion.
- aef89e9f-9d50-4fde-8228-740059cb5c16 contentSize "2238" assertion.
- 6a7dbfa6-fc66-449f-bbe5-2b4d83cbabb3 contentSize "2496" assertion.
- 04bdbfbd-b95b-4ec8-9b2b-36bfc698c52b contentSize "3220" assertion.
- 6f6e8ba4-ce00-45e0-9063-30e54af32626 contentSize "5541" assertion.
- b2624fab-49ca-4a8d-aaa5-496ea8a622b2 contentSize "5834" assertion.
- 8b8059d8-8685-4549-b8ba-45c66abbd09d contentSize "7351" assertion.
- b1c57e68-6255-41b8-b621-555a2ae094ae contentSize "7912" assertion.
- 87226596-ccc0-45e0-a273-9bf48505f929 contentSize "8116" assertion.
- 6b5ce189-93d3-4817-8879-57c83efcadea contentSize "10280" assertion.
- 87333d63-a279-4521-b93e-ba87ad7af3ba contentSize "11043" assertion.
- c0ff9b95-c133-46a2-8b43-7e9adda5a910 contentSize "13756" assertion.
- 07e9fdac-98dc-4a1c-9785-d8d311108cde contentSize "13959" assertion.
- 51b0f2f4-234c-4654-99b4-17c3860a5d44 contentSize "20900" assertion.
- 1a309c23-b22b-4384-acd2-b47eace15095 contentSize "21508" assertion.
- f02fd4e5-6869-4afc-a651-5e912e99b841 contentSize "22743" assertion.
- a056477a-bbc1-45d0-9ba4-3b8c289a1bb6 contentSize "30231" assertion.
- b0c9b02f-83cb-410f-8028-cfd13491f0d1 contentSize "30231" assertion.
- 9f594136-23cb-4514-a3b0-7ca5193003cf contentSize "54971" assertion.
- 3b6d6431-467c-4cb3-8a07-67a1c561d23c contentSize "83820" assertion.
- 7e5af759-e546-4f73-b221-18bdb12c38b0 contentSize "1062372" assertion.
- 24313793-4361-4407-aa07-4df8f0cee2f5 contentSize "1064201" assertion.
- c2131dd6-62cc-4d44-8083-704edbc6a3ce contentSize "3311534" assertion.
- ebc55bb6-993d-4f70-b995-17a4664bb0e9 contentSize "6489975" assertion.
- 074153e1-485d-47c3-8fd1-9b872f4d9bd3 contentSize "11917056" assertion.
- abc766e7-ba36-456b-87de-2f17ee5ae845 contentSize "60418600" assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d contentSize "73915655" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 contentSize "10251" assertion.
- ee6fb1ee-e7de-42d7-82e0-ed5e33bb3c36 contentSize "54971" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 dateCreated "2023-10-19 23:56:34+00:00" assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d dateCreated "2023-10-20 08:40:49.625147+00:00" assertion.
- ed49a599-f8a8-4c12-a05d-3712467babdf dateCreated "2023-10-20 08:40:50.855228+00:00" assertion.
- 7d719c89-cfba-4eca-8341-ac6daa6b370f dateCreated "2023-10-20 08:40:50.856652+00:00" assertion.
- 04bdbfbd-b95b-4ec8-9b2b-36bfc698c52b dateCreated "2023-10-20 08:40:50.857721+00:00" assertion.
- aef89e9f-9d50-4fde-8228-740059cb5c16 dateCreated "2023-10-20 08:40:50.858823+00:00" assertion.
- 06814c66-d6f5-4d56-8024-1c8d2e58a4db dateCreated "2023-10-20 08:40:50.859933+00:00" assertion.
- a056477a-bbc1-45d0-9ba4-3b8c289a1bb6 dateCreated "2023-10-20 08:40:50.862139+00:00" assertion.
- 24313793-4361-4407-aa07-4df8f0cee2f5 dateCreated "2023-10-20 08:40:50.872283+00:00" assertion.
- 7e5af759-e546-4f73-b221-18bdb12c38b0 dateCreated "2023-10-20 08:40:50.884026+00:00" assertion.
- 074153e1-485d-47c3-8fd1-9b872f4d9bd3 dateCreated "2023-10-20 08:40:50.969541+00:00" assertion.
- b0c9b02f-83cb-410f-8028-cfd13491f0d1 dateCreated "2023-10-20 08:40:50.971316+00:00" assertion.
- b2624fab-49ca-4a8d-aaa5-496ea8a622b2 dateCreated "2023-10-20 08:40:50.972235+00:00" assertion.
- ee6fb1ee-e7de-42d7-82e0-ed5e33bb3c36 dateCreated "2023-10-20 08:40:50.973456+00:00" assertion.
- 21dfa77e-a9f7-4ad4-badb-f8dcc1bcc931 dateCreated "2023-10-20 08:40:50.975444+00:00" assertion.
- a9f6e252-5e53-42a4-8a7f-b4a97b1c40cd dateCreated "2023-10-20 08:40:50.976290+00:00" assertion.
- 572775af-ed1f-4c29-8aa5-884686ab257d dateCreated "2023-10-20 08:40:50.977115+00:00" assertion.
- c16775dc-e7c3-486c-990c-3be2c5ea7a1a dateCreated "2023-10-20 08:40:50.977925+00:00" assertion.
- c0ff9b95-c133-46a2-8b43-7e9adda5a910 dateCreated "2023-10-20 08:40:50.980132+00:00" assertion.
- 8b8059d8-8685-4549-b8ba-45c66abbd09d dateCreated "2023-10-20 08:40:50.981025+00:00" assertion.
- b1c57e68-6255-41b8-b621-555a2ae094ae dateCreated "2023-10-20 08:40:50.981927+00:00" assertion.
- 55ecaa21-84d9-42ec-b9bc-167cf49afa09 dateCreated "2023-10-20 08:40:50.982729+00:00" assertion.
- 24f02bf4-0975-45cf-9f5a-ec39be7e52c5 dateCreated "2023-10-20 08:40:50.983565+00:00" assertion.
- 9456290e-07e1-4286-b1f0-1dd722a7b653 dateCreated "2023-10-20 08:40:50.984377+00:00" assertion.
- c16364f9-cdaf-4f69-8b2d-24ec86e0760a dateCreated "2023-10-20 08:40:50.985206+00:00" assertion.
- 71bd00c7-2443-4a7c-b15f-63bab9d9e1aa dateCreated "2023-10-20 08:40:50.986013+00:00" assertion.
- 07e9fdac-98dc-4a1c-9785-d8d311108cde dateCreated "2023-10-20 08:40:50.986923+00:00" assertion.
- f02fd4e5-6869-4afc-a651-5e912e99b841 dateCreated "2023-10-20 08:40:50.987866+00:00" assertion.
- 51b0f2f4-234c-4654-99b4-17c3860a5d44 dateCreated "2023-10-20 08:40:50.988797+00:00" assertion.
- 87333d63-a279-4521-b93e-ba87ad7af3ba dateCreated "2023-10-20 08:40:50.989648+00:00" assertion.
- 12c2cb86-f0cb-4af4-9573-fe75e143f7e8 dateCreated "2023-10-20 08:40:50.990475+00:00" assertion.
- 87226596-ccc0-45e0-a273-9bf48505f929 dateCreated "2023-10-20 08:40:50.991348+00:00" assertion.
- ebc55bb6-993d-4f70-b995-17a4664bb0e9 dateCreated "2023-10-20 08:40:51.016019+00:00" assertion.
- c2131dd6-62cc-4d44-8083-704edbc6a3ce dateCreated "2023-10-20 08:40:51.044984+00:00" assertion.
- 7776b24b-a086-4195-9f8f-056836c6a0a0 dateCreated "2023-10-20 08:40:51.045900+00:00" assertion.
- 15c39d0d-0f43-4d64-8eba-6404356a5adf dateCreated "2023-10-20 08:40:51.046715+00:00" assertion.
- 6f6e8ba4-ce00-45e0-9063-30e54af32626 dateCreated "2023-10-20 08:40:51.047714+00:00" assertion.
- 9f594136-23cb-4514-a3b0-7ca5193003cf dateCreated "2023-10-20 08:40:51.048967+00:00" assertion.
- abc766e7-ba36-456b-87de-2f17ee5ae845 dateCreated "2023-10-20 08:40:51.460476+00:00" assertion.
- 6a7dbfa6-fc66-449f-bbe5-2b4d83cbabb3 dateCreated "2023-10-20 08:40:51.461457+00:00" assertion.
- 6b5ce189-93d3-4817-8879-57c83efcadea dateCreated "2023-10-20 08:40:51.462174+00:00" assertion.