Matches in Nanopublications for { ?s ?p ?o <https://w3id.org/np/RAZtLdcUMAv18L-0CpO6ZihqA2gMmX_QD6n3DNIyLsdwk/assertion>. }
- f2e728b8-5c9c-4555-9232-0131092b6272 type Domain assertion.
- f3327c85-d7a6-4af9-b92b-38a8edd6d3c4 type Lemma assertion.
- f494add2-e4bb-40b1-9113-cc071ade8d77 type Lemma assertion.
- f8050931-c4dc-48d0-85d7-489292466a09 type Lemma assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d mainEntity "workflow_hpAugmentation/Snakefile" assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d importedBy 0000-0003-2388-0744 assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 sdPublisher "https://about.workflowhub.eu/" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 image "reduced_workflow.png" assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d creativeWorkStatus "Stable" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 producer "https://workflowhub.eu/projects/200" assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d description dd5c3d62-b632-46a1-99e4-761f2e6cb60d assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d description "## Summary HPPIDiscovery is a scientific workflow to augment, predict and perform an insilico curation of host-pathogen Protein-Protein Interactions (PPIs) using graph theory to build new candidate ppis and machine learning to predict and evaluate them by combining multiple PPI detection methods of proteins according to three categories: structural, based on primary aminoacid sequence and functional annotations.<br> HPPIDiscovery contains three main steps: (i) acquirement of pathogen and host proteins information from seed ppis provided by HPIDB search methods, (ii) Model training and generation of new candidate ppis from HPIDB seed proteins' partners, and (iii) Evaluation of new candidate ppis and results exportation. (i) The first step acquires the identification of the taxonomy ids of the host and pathogen organisms in the result files. Then it proceeds parsing and cleaning the HPIDB results and downloading the protein interactions of the found organisms from the STRING database. The string protein identifiers are also mapped using the id mapping tool of uniprot API and we retrieve the uniprot entry ids along with the functional annotations, sequence, domain and kegg enzymes. (ii) The second step builds the training dataset using the non redundant hpidb validated interactions of each genome as positive set and random string low confidence ppis from each genome as negative set. Then, PredPrin tool is executed in the training mode to obtain the model that will evaluate the new candidate PPIs. The new ppis are then generated by performing a pairwise combination of string partners of host and pathogen hpidb proteins. Finally, (iii) in the third step, the predprin tool is used in the test mode to evaluate the new ppis and generate the reports and list of positively predicted ppis. The figure below illustrates the steps of this workflow. ## Requirements: * Edit the configuration file (config.yaml) according to your own data, filling out the following fields: - base_data: location of the organism folders directory, example: /home/user/data/genomes - parameters_file: Since this workflow may perform parallel processing of multiple organisms at the same time, you must prepate a tabulated file containng the genome folder names located in base data, where the hpidb files are located. Example: /home/user/data/params.tsv. It must have the following columns: genome (folder name), hpidb_seed_network (the result exported by one of the search methods available in hpidb database), hpidb_search_method (the type of search used to generate the results) and target_taxon (the target taxon id). The column hpidb_source may have two values: keyword or homology. In the keyword mode, you provide a taxonomy, protein name, publication id or detection method and you save all results (mitab.zip) in the genome folder. Finally, in the homology mode allows the user to search for host pathogen ppis giving as input fasta sequences of a set of proteins of the target pathgen for enrichment (so you have to select the search for a pathogen set) and you save the zip folder results (interaction data) in the genome folder. This option is extremely useful when you are not sure that your organism has validated protein interactions, then it finds validated interactions from the closest proteins in the database. In case of using the homology mode, the identifiers of the pathogens' query fasta sequences must be a Uniprot ID. All the query protein IDs must belong to the same target organism (taxon id). - model_file: path of a previously trained model in joblib format (if you want to train from the known validated PPIs given as seeds, just put a 'None' value) ## Usage Instructions The steps below consider the creation of a sqlite database file with all he tasks events which can be used after to retrieve the execution time taken by the tasks. It is possible run locally too (see luigi's documentation to change the running command). <br><br> * Preparation: 1. ````git clone https://github.com/YasCoMa/hppidiscovery.git```` 2. ````cd hppidiscovery```` 3. ````mkdir luigi_log```` 4. ````luigid --background --logdir luigi_log```` (start luigi server) 5. conda env create -f hp_ppi_augmentation.yml 6. conda activate hp_ppi_augmentation 6.1. (execute ````pip3 install wget```` (it is not installed in the environment)) 7. run ````pwd```` command and get the full path 8. Substitute in config_example.yaml with the full path obtained in the previous step 9. Download SPRINT pre-computed similarities in https://www.csd.uwo.ca/~ilie/SPRINT/precomputed_similarities.zip and unzip it inside workflow_hpAugmentation/predprin/core/sprint/HSP/ 10. ````cd workflow_hpAugmentation/predprin/```` 11. Uncompress annotation_data.zip 12. Uncompress sequence_data.zip 13. ````cd ../../```` 14. ````cd workflow_hpAugmentation```` 15. snake -n (check the plan of jobs, it should return no errors and exceptions) 16. snakemake -j 4 (change this number according the number of genomes to analyse and the amount of cores available in your machine)" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 description "## Summary HPPIDiscovery is a scientific workflow to augment, predict and perform an insilico curation of host-pathogen Protein-Protein Interactions (PPIs) using graph theory to build new candidate ppis and machine learning to predict and evaluate them by combining multiple PPI detection methods of proteins according to three categories: structural, based on primary aminoacid sequence and functional annotations.<br> HPPIDiscovery contains three main steps: (i) acquirement of pathogen and host proteins information from seed ppis provided by HPIDB search methods, (ii) Model training and generation of new candidate ppis from HPIDB seed proteins' partners, and (iii) Evaluation of new candidate ppis and results exportation. (i) The first step acquires the identification of the taxonomy ids of the host and pathogen organisms in the result files. Then it proceeds parsing and cleaning the HPIDB results and downloading the protein interactions of the found organisms from the STRING database. The string protein identifiers are also mapped using the id mapping tool of uniprot API and we retrieve the uniprot entry ids along with the functional annotations, sequence, domain and kegg enzymes. (ii) The second step builds the training dataset using the non redundant hpidb validated interactions of each genome as positive set and random string low confidence ppis from each genome as negative set. Then, PredPrin tool is executed in the training mode to obtain the model that will evaluate the new candidate PPIs. The new ppis are then generated by performing a pairwise combination of string partners of host and pathogen hpidb proteins. Finally, (iii) in the third step, the predprin tool is used in the test mode to evaluate the new ppis and generate the reports and list of positively predicted ppis. The figure below illustrates the steps of this workflow. ## Requirements: * Edit the configuration file (config.yaml) according to your own data, filling out the following fields: - base_data: location of the organism folders directory, example: /home/user/data/genomes - parameters_file: Since this workflow may perform parallel processing of multiple organisms at the same time, you must prepate a tabulated file containng the genome folder names located in base data, where the hpidb files are located. Example: /home/user/data/params.tsv. It must have the following columns: genome (folder name), hpidb_seed_network (the result exported by one of the search methods available in hpidb database), hpidb_search_method (the type of search used to generate the results) and target_taxon (the target taxon id). The column hpidb_source may have two values: keyword or homology. In the keyword mode, you provide a taxonomy, protein name, publication id or detection method and you save all results (mitab.zip) in the genome folder. Finally, in the homology mode allows the user to search for host pathogen ppis giving as input fasta sequences of a set of proteins of the target pathgen for enrichment (so you have to select the search for a pathogen set) and you save the zip folder results (interaction data) in the genome folder. This option is extremely useful when you are not sure that your organism has validated protein interactions, then it finds validated interactions from the closest proteins in the database. In case of using the homology mode, the identifiers of the pathogens' query fasta sequences must be a Uniprot ID. All the query protein IDs must belong to the same target organism (taxon id). - model_file: path of a previously trained model in joblib format (if you want to train from the known validated PPIs given as seeds, just put a 'None' value) ## Usage Instructions The steps below consider the creation of a sqlite database file with all he tasks events which can be used after to retrieve the execution time taken by the tasks. It is possible run locally too (see luigi's documentation to change the running command). <br><br> * Preparation: 1. ````git clone https://github.com/YasCoMa/hppidiscovery.git```` 2. ````cd hppidiscovery```` 3. ````mkdir luigi_log```` 4. ````luigid --background --logdir luigi_log```` (start luigi server) 5. conda env create -f hp_ppi_augmentation.yml 6. conda activate hp_ppi_augmentation 6.1. (execute ````pip3 install wget```` (it is not installed in the environment)) 7. run ````pwd```` command and get the full path 8. Substitute in config_example.yaml with the full path obtained in the previous step 9. Download SPRINT pre-computed similarities in https://www.csd.uwo.ca/~ilie/SPRINT/precomputed_similarities.zip and unzip it inside workflow_hpAugmentation/predprin/core/sprint/HSP/ 10. ````cd workflow_hpAugmentation/predprin/```` 11. Uncompress annotation_data.zip 12. Uncompress sequence_data.zip 13. ````cd ../../```` 14. ````cd workflow_hpAugmentation```` 15. snake -n (check the plan of jobs, it should return no errors and exceptions) 16. snakemake -j 4 (change this number according the number of genomes to analyse and the amount of cores available in your machine)" assertion.
- Workflow-RO-Crate version "0.2.0" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 version "1" assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d contentSize "73915655" assertion.
- 04bdbfbd-b95b-4ec8-9b2b-36bfc698c52b contentSize "3220" assertion.
- 06814c66-d6f5-4d56-8024-1c8d2e58a4db contentSize "1055" assertion.
- 074153e1-485d-47c3-8fd1-9b872f4d9bd3 contentSize "11917056" assertion.
- 07e9fdac-98dc-4a1c-9785-d8d311108cde contentSize "13959" assertion.
- 12c2cb86-f0cb-4af4-9573-fe75e143f7e8 contentSize "371" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 contentSize "10251" assertion.
- 15c39d0d-0f43-4d64-8eba-6404356a5adf contentSize "101" assertion.
- 1a309c23-b22b-4384-acd2-b47eace15095 contentSize "21508" assertion.
- 21dfa77e-a9f7-4ad4-badb-f8dcc1bcc931 contentSize "1363" assertion.
- 24313793-4361-4407-aa07-4df8f0cee2f5 contentSize "1064201" assertion.
- 24f02bf4-0975-45cf-9f5a-ec39be7e52c5 contentSize "239" assertion.
- 3b6d6431-467c-4cb3-8a07-67a1c561d23c contentSize "83820" assertion.
- 51b0f2f4-234c-4654-99b4-17c3860a5d44 contentSize "20900" assertion.
- 55ecaa21-84d9-42ec-b9bc-167cf49afa09 contentSize "110" assertion.
- 572775af-ed1f-4c29-8aa5-884686ab257d contentSize "24" assertion.
- 6a7dbfa6-fc66-449f-bbe5-2b4d83cbabb3 contentSize "2496" assertion.
- 6b5ce189-93d3-4817-8879-57c83efcadea contentSize "10280" assertion.
- 6f6e8ba4-ce00-45e0-9063-30e54af32626 contentSize "5541" assertion.
- 71bd00c7-2443-4a7c-b15f-63bab9d9e1aa contentSize "488" assertion.
- 7776b24b-a086-4195-9f8f-056836c6a0a0 contentSize "236" assertion.
- 7d719c89-cfba-4eca-8341-ac6daa6b370f contentSize "400" assertion.
- 7e5af759-e546-4f73-b221-18bdb12c38b0 contentSize "1062372" assertion.
- 87226596-ccc0-45e0-a273-9bf48505f929 contentSize "8116" assertion.
- 87333d63-a279-4521-b93e-ba87ad7af3ba contentSize "11043" assertion.
- 8b8059d8-8685-4549-b8ba-45c66abbd09d contentSize "7351" assertion.
- 9456290e-07e1-4286-b1f0-1dd722a7b653 contentSize "172" assertion.
- 9f594136-23cb-4514-a3b0-7ca5193003cf contentSize "54971" assertion.
- a056477a-bbc1-45d0-9ba4-3b8c289a1bb6 contentSize "30231" assertion.
- a9f6e252-5e53-42a4-8a7f-b4a97b1c40cd contentSize "24" assertion.
- abc766e7-ba36-456b-87de-2f17ee5ae845 contentSize "60418600" assertion.
- aef89e9f-9d50-4fde-8228-740059cb5c16 contentSize "2238" assertion.
- b0c9b02f-83cb-410f-8028-cfd13491f0d1 contentSize "30231" assertion.
- b1c57e68-6255-41b8-b621-555a2ae094ae contentSize "7912" assertion.
- b2624fab-49ca-4a8d-aaa5-496ea8a622b2 contentSize "5834" assertion.
- bf79d072-d5db-4eb9-b7fe-602b9bb7243c contentSize "0" assertion.
- c0ff9b95-c133-46a2-8b43-7e9adda5a910 contentSize "13756" assertion.
- c16364f9-cdaf-4f69-8b2d-24ec86e0760a contentSize "121" assertion.
- c16775dc-e7c3-486c-990c-3be2c5ea7a1a contentSize "24" assertion.
- c2131dd6-62cc-4d44-8083-704edbc6a3ce contentSize "3311534" assertion.
- ebc55bb6-993d-4f70-b995-17a4664bb0e9 contentSize "6489975" assertion.
- ed49a599-f8a8-4c12-a05d-3712467babdf contentSize "162" assertion.
- ee6fb1ee-e7de-42d7-82e0-ed5e33bb3c36 contentSize "54971" assertion.
- f02fd4e5-6869-4afc-a651-5e912e99b841 contentSize "22743" assertion.
- dd5c3d62-b632-46a1-99e4-761f2e6cb60d dateCreated "2023-10-20 08:40:49.625147+00:00" assertion.
- 04bdbfbd-b95b-4ec8-9b2b-36bfc698c52b dateCreated "2023-10-20 08:40:50.857721+00:00" assertion.
- 06814c66-d6f5-4d56-8024-1c8d2e58a4db dateCreated "2023-10-20 08:40:50.859933+00:00" assertion.
- 074153e1-485d-47c3-8fd1-9b872f4d9bd3 dateCreated "2023-10-20 08:40:50.969541+00:00" assertion.
- 07e9fdac-98dc-4a1c-9785-d8d311108cde dateCreated "2023-10-20 08:40:50.986923+00:00" assertion.
- 12c2cb86-f0cb-4af4-9573-fe75e143f7e8 dateCreated "2023-10-20 08:40:50.990475+00:00" assertion.
- 13c69a83-de3f-4379-b137-6a12d45bf6e7 dateCreated "2023-10-19 23:56:34+00:00" assertion.
- 15c39d0d-0f43-4d64-8eba-6404356a5adf dateCreated "2023-10-20 08:40:51.046715+00:00" assertion.
- 1a309c23-b22b-4384-acd2-b47eace15095 dateCreated "2023-10-20 08:40:51.464954+00:00" assertion.
- 21dfa77e-a9f7-4ad4-badb-f8dcc1bcc931 dateCreated "2023-10-20 08:40:50.975444+00:00" assertion.
- 24313793-4361-4407-aa07-4df8f0cee2f5 dateCreated "2023-10-20 08:40:50.872283+00:00" assertion.
- 24f02bf4-0975-45cf-9f5a-ec39be7e52c5 dateCreated "2023-10-20 08:40:50.983565+00:00" assertion.
- 3b6d6431-467c-4cb3-8a07-67a1c561d23c dateCreated "2023-10-20 08:40:51.464082+00:00" assertion.
- 51b0f2f4-234c-4654-99b4-17c3860a5d44 dateCreated "2023-10-20 08:40:50.988797+00:00" assertion.
- 55ecaa21-84d9-42ec-b9bc-167cf49afa09 dateCreated "2023-10-20 08:40:50.982729+00:00" assertion.
- 572775af-ed1f-4c29-8aa5-884686ab257d dateCreated "2023-10-20 08:40:50.977115+00:00" assertion.
- 6a7dbfa6-fc66-449f-bbe5-2b4d83cbabb3 dateCreated "2023-10-20 08:40:51.461457+00:00" assertion.
- 6b5ce189-93d3-4817-8879-57c83efcadea dateCreated "2023-10-20 08:40:51.462174+00:00" assertion.
- 6f6e8ba4-ce00-45e0-9063-30e54af32626 dateCreated "2023-10-20 08:40:51.047714+00:00" assertion.
- 71bd00c7-2443-4a7c-b15f-63bab9d9e1aa dateCreated "2023-10-20 08:40:50.986013+00:00" assertion.
- 7776b24b-a086-4195-9f8f-056836c6a0a0 dateCreated "2023-10-20 08:40:51.045900+00:00" assertion.
- 7d719c89-cfba-4eca-8341-ac6daa6b370f dateCreated "2023-10-20 08:40:50.856652+00:00" assertion.
- 7e5af759-e546-4f73-b221-18bdb12c38b0 dateCreated "2023-10-20 08:40:50.884026+00:00" assertion.
- 87226596-ccc0-45e0-a273-9bf48505f929 dateCreated "2023-10-20 08:40:50.991348+00:00" assertion.
- 87333d63-a279-4521-b93e-ba87ad7af3ba dateCreated "2023-10-20 08:40:50.989648+00:00" assertion.
- 8b8059d8-8685-4549-b8ba-45c66abbd09d dateCreated "2023-10-20 08:40:50.981025+00:00" assertion.
- 9456290e-07e1-4286-b1f0-1dd722a7b653 dateCreated "2023-10-20 08:40:50.984377+00:00" assertion.
- 9f594136-23cb-4514-a3b0-7ca5193003cf dateCreated "2023-10-20 08:40:51.048967+00:00" assertion.
- a056477a-bbc1-45d0-9ba4-3b8c289a1bb6 dateCreated "2023-10-20 08:40:50.862139+00:00" assertion.
- a9f6e252-5e53-42a4-8a7f-b4a97b1c40cd dateCreated "2023-10-20 08:40:50.976290+00:00" assertion.
- abc766e7-ba36-456b-87de-2f17ee5ae845 dateCreated "2023-10-20 08:40:51.460476+00:00" assertion.
- aef89e9f-9d50-4fde-8228-740059cb5c16 dateCreated "2023-10-20 08:40:50.858823+00:00" assertion.
- b0c9b02f-83cb-410f-8028-cfd13491f0d1 dateCreated "2023-10-20 08:40:50.971316+00:00" assertion.
- b1c57e68-6255-41b8-b621-555a2ae094ae dateCreated "2023-10-20 08:40:50.981927+00:00" assertion.
- b2624fab-49ca-4a8d-aaa5-496ea8a622b2 dateCreated "2023-10-20 08:40:50.972235+00:00" assertion.
- bf79d072-d5db-4eb9-b7fe-602b9bb7243c dateCreated "2023-10-20 08:40:51.462840+00:00" assertion.
- c0ff9b95-c133-46a2-8b43-7e9adda5a910 dateCreated "2023-10-20 08:40:50.980132+00:00" assertion.
- c16364f9-cdaf-4f69-8b2d-24ec86e0760a dateCreated "2023-10-20 08:40:50.985206+00:00" assertion.
- c16775dc-e7c3-486c-990c-3be2c5ea7a1a dateCreated "2023-10-20 08:40:50.977925+00:00" assertion.
- c2131dd6-62cc-4d44-8083-704edbc6a3ce dateCreated "2023-10-20 08:40:51.044984+00:00" assertion.
- ebc55bb6-993d-4f70-b995-17a4664bb0e9 dateCreated "2023-10-20 08:40:51.016019+00:00" assertion.