Matches in Nanopublications for { <https://w3id.org/ro-id/7019724e-b5a0-4f7e-a7d6-a1baacac85df/> ?p ?o ?g. }
Showing items 1 to 28 of
28
with 100 items per page.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df type ResearchObject assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df type LiveRO assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df type Dataset assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df mainEntity "Snakefile" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df importedBy 0000-0003-2388-0744 assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df description "# ERGA Protein-coding gene annotation workflow. Adapted from the work of Sagane Joye: https://github.com/sdind/genome_annotation_workflow ## Prerequisites The following programs are required to run the workflow and the listed version were tested. It should be noted that older versions of snakemake are not compatible with newer versions of singularity as is noted here: [https://github.com/nextflow-io/nextflow/issues/1659](https://github.com/nextflow-io/nextflow/issues/1659). `conda v 23.7.3` `singularity v 3.7.3` `snakemake v 7.32.3` You will also need to acquire a licence key for Genemark and place this in your home directory with name `~/.gm_key` The key file can be obtained from the following location, where the licence should be read and agreed to: http://topaz.gatech.edu/GeneMark/license_download.cgi ## Workflow The pipeline is based on braker3 and was tested on the following dataset from Drosophila melanogaster: [https://doi.org/10.5281/zenodo.8013373](https://doi.org/10.5281/zenodo.8013373) ### Input data - Reference genome in fasta format - RNAseq data in paired-end zipped fastq format - uniprot fasta sequences in zipped fasta format ### Pipeline steps - **Repeat Model and Mask** Run RepeatModeler using the genome as input, filter any repeats also annotated as protein sequences in the uniprot database and use this filtered libray to mask the genome with RepeatMasker - **Map RNAseq data** Trim any remaining adapter sequences and map the trimmed reads to the input genome - **Run gene prediction software** Use the mapped RNAseq reads and the uniprot sequences to create hints for gene prediction using Braker3 on the masked genome - **Evaluate annotation** Run BUSCO to evaluate the completeness of the annotation produced ### Output data - FastQC reports for input RNAseq data before and after adapter trimming - RepeatMasker report containing quantity of masked sequence and distribution among TE families - Protein-coding gene annotation file in gff3 format - BUSCO summary of annotated sequences ## Setup Your data should be placed in the `data` folder, with the reference genome in the folder `data/ref` and the transcript data in the foler `data/rnaseq`. The config file requires the following to be given: ``` asm: 'absolute path to reference fasta' snakemake_dir_path: 'path to snakemake working directory' name: 'name for project, e.g. mHomSap1' RNA_dir: 'absolute path to rnaseq directory' busco_phylum: 'busco database to use for evaluation e.g. mammalia_odb10' ```" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df description "# ERGA Protein-coding gene annotation workflow. Adapted from the work of Sagane Joye: https://github.com/sdind/genome_annotation_workflow ## Prerequisites The following programs are required to run the workflow and the listed version were tested. It should be noted that older versions of snakemake are not compatible with newer versions of singularity as is noted here: [https://github.com/nextflow-io/nextflow/issues/1659](https://github.com/nextflow-io/nextflow/issues/1659). `conda v 23.7.3` `singularity v 3.7.3` `snakemake v 7.32.3` You will also need to acquire a licence key for Genemark and place this in your home directory with name `~/.gm_key` The key file can be obtained from the following location, where the licence should be read and agreed to: http://topaz.gatech.edu/GeneMark/license_download.cgi ## Workflow The pipeline is based on braker3 and was tested on the following dataset from Drosophila melanogaster: [https://doi.org/10.5281/zenodo.8013373](https://doi.org/10.5281/zenodo.8013373) ### Input data - Reference genome in fasta format - RNAseq data in paired-end zipped fastq format - uniprot fasta sequences in zipped fasta format ### Pipeline steps - **Repeat Model and Mask** Run RepeatModeler using the genome as input, filter any repeats also annotated as protein sequences in the uniprot database and use this filtered libray to mask the genome with RepeatMasker - **Map RNAseq data** Trim any remaining adapter sequences and map the trimmed reads to the input genome - **Run gene prediction software** Use the mapped RNAseq reads and the uniprot sequences to create hints for gene prediction using Braker3 on the masked genome - **Evaluate annotation** Run BUSCO to evaluate the completeness of the annotation produced ### Output data - FastQC reports for input RNAseq data before and after adapter trimming - RepeatMasker report containing quantity of masked sequence and distribution among TE families - Protein-coding gene annotation file in gff3 format - BUSCO summary of annotated sequences ## Setup Your data should be placed in the `data` folder, with the reference genome in the folder `data/ref` and the transcript data in the foler `data/rnaseq`. The config file requires the following to be given: ``` asm: 'absolute path to reference fasta' snakemake_dir_path: 'path to snakemake working directory' name: 'name for project, e.g. mHomSap1' RNA_dir: 'absolute path to rnaseq directory' busco_phylum: 'busco database to use for evaluation e.g. mammalia_odb10' ``` " assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df contentSize "91622645" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df dateCreated "2023-09-13 18:24:50.860910+00:00" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df cite-as "Sagane Joye-Dind. "Research Object Crate for ERGA Protein-coding gene annotation workflow." ROHub. Sep 13 ,2023. https://w3id.org/ro-id/7019724e-b5a0-4f7e-a7d6-a1baacac85df." assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df url "https://workflowhub.eu/workflows/569/ro_crate?version=1" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df name "Research Object Crate for ERGA Protein-coding gene annotation workflow" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df contentUrl "https://api.rohub.org/api/ros/7019724e-b5a0-4f7e-a7d6-a1baacac85df/crate/download/" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df creator 0000-0003-2388-0744 assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df dateModified "2024-03-05 12:23:14.184651+00:00" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df datePublished "2023-09-13 18:24:50.860910+00:00" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df encodingFormat "application/ld+json" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df hasPart 17b98cad-68c7-4991-acb1-b922f0c3d44f assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df hasPart 28d4cc6a-cc26-4629-865c-951dcec04a63 assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df hasPart 72f47650-7679-4648-bb43-d28fa77665c4 assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df hasPart ef0d1b26-8b1c-4a45-9758-52610cb8e3a8 assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df hasPart 043ddb69-1a1c-4eb4-8785-9e3803fa377c assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df hasPart 11f3e069-8d1d-48da-876f-52fd6d255223 assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df hasPart 89e1f03d-e2e9-46f1-8386-8cdeff35386c assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df identifier "https://w3id.org/ro-id/7019724e-b5a0-4f7e-a7d6-a1baacac85df" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df license no-permission assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df isBasedOn "https://github.com/ERGA-consortium/pipelines/tree/main/annotation/snakemake" assertion.
- 7019724e-b5a0-4f7e-a7d6-a1baacac85df author 0000-0003-4771-6113 assertion.