usegalaxy-playbook gtf sniffed as gff

Example files, uploaded via URL or browser:

https://zenodo.org/record/290221/files/dexseq.gtf
https://zenodo.org/record/290221/files/Drosophila_melanogaster.BDGP5.78.gtf

Test history:

https://usegalaxy.org/u/jen/h/gtf-sniffed-as-gff

Tutorial:

https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html

Data Source:

https://zenodo.org/record/290221

Interestingly, gff3 is now accurately sniffed as gff3 (and not gff), also included in history for reference. This seems new but may not actually be new.

Tutorial:

https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html

Data Source:

https://zenodo.org/record/582600

ping @guerler

Oct 03 '17 02:10 jennaj

Is this Main-specific? It seems like an issue with the datatypes conf sniffer order or the sniffers themselves. Main has been using the sample datatypes_conf.xml since 5187d5de7e15e6cb5f375f34e9473c606e6973e8

Feb 23 '18 15:02 natefoo

Reran the tests, same result.

The problem is that the format of some datasets are a hybrid of GTF/GFF3 formatting - created by the DEXseq prepare annotation tool or manipulated datasets from other sources.

Line examples with mix of GFF/GFF3 and GTF style annotation in 9th group/attribute field:

Seqname Source Feature Start End Score Strand Frame Group

2L dexseq_prepare_annotation.py aggregate_gene 378112 387439 . + . gene_id "FBgn0000061"

2L dexseq_prepare_annotation.py exonic_part 378112 378481 . + . transcripts "FBtr0078053"; exonic_part_number "001"; gene_id "FBgn0000061"

2L dexseq_prepare_annotation.py exonic_part 384511 384894 . + . transcripts "FBtr0078053"; exonic_part_number "002"; gene_id "FBgn0000061"

2L dexseq_prepare_annotation.py exonic_part 385701 385746 . + . transcripts "FBtr0078053"; exonic_part_number "003"; gene_id "FBgn0000061"

2L dexseq_prepare_annotation.py exonic_part 386308 386576 . + . transcripts "FBtr0078053"; exonic_part_number "004"; gene_id "FBgn0000061"

2L dexseq_prepare_annotation.py exonic_part 386703 387439 . + . transcripts "FBtr0078053"; exonic_part_number "005"; gene_id "FBgn0000061"

Seqname	Source	Feature	Start	End	Score	Strand	Frame	Group
2L	dexseq_prepare_annotation.py	aggregate_gene	378112	387439	.	+	.	gene_id "FBgn0000061"
2L	dexseq_prepare_annotation.py	exonic_part	378112	378481	.	+	.	transcripts "FBtr0078053"; exonic_part_number "001"; gene_id "FBgn0000061"
2L	dexseq_prepare_annotation.py	exonic_part	384511	384894	.	+	.	transcripts "FBtr0078053"; exonic_part_number "002"; gene_id "FBgn0000061"
2L	dexseq_prepare_annotation.py	exonic_part	385701	385746	.	+	.	transcripts "FBtr0078053"; exonic_part_number "003"; gene_id "FBgn0000061"
2L	dexseq_prepare_annotation.py	exonic_part	386308	386576	.	+	.	transcripts "FBtr0078053"; exonic_part_number "004"; gene_id "FBgn0000061"
2L	dexseq_prepare_annotation.py	exonic_part	386703	387439	.	+	.	transcripts "FBtr0078053"; exonic_part_number "005"; gene_id "FBgn0000061"

Seqname Source Feature Start End Score Strand Frame Group

chrUextra FlyBase gene 523024 523086 . - . gene_id "FBgn0264003"; gene_version "5"; gene_name "mir-5613"; gene_source "FlyBase"; gene_biotype "pre_miRNA";

chrUextra FlyBase transcript 523024 523086 . - . gene_id "FBgn0264003"; gene_version "5"; transcript_id "FBtr0330361"; transcript_version "5"; gene_name "mir-5613"; gene_source "FlyBase"; gene_biotype "pre_miRNA"; transcript_name "mir-5613-RM"; transcript_source "FlyBase"; transcript_biotype "pre_miRNA";

chrUextra FlyBase exon 523024 523086 . - . gene_id "FBgn0264003"; gene_version "5"; transcript_id "FBtr0330361"; transcript_version "5"; exon_number "1"; gene_name "mir-5613"; gene_source "FlyBase"; gene_biotype "pre_miRNA"; transcript_name "mir-5613-RM"; transcript_source "FlyBase"; transcript_biotype "pre_miRNA"; exon_id "FBtr0330361:1"; exon_version "1";

chrUextra FlyBase transcript 523024 523048 . - . gene_id "FBgn0264003"; gene_version "5"; transcript_id "FBtr0330363"; transcript_version "5"; gene_name "mir-5613"; gene_source "FlyBase"; gene_biotype "pre_miRNA"; transcript_name "mir-5613-RB"; transcript_source "FlyBase"; transcript_biotype "miRNA";

chrUextra FlyBase exon 523024 523048 . - . gene_id "FBgn0264003"; gene_version "5"; transcript_id "FBtr0330363"; transcript_version "5"; exon_number "1"; gene_name "mir-5613"; gene_source "FlyBase"; gene_biotype "pre_miRNA"; transcript_name "mir-5613-RB"; transcript_source "FlyBase"; transcript_biotype "miRNA"; exon_id "FBtr0330363:1"; exon_version "1";

Seqname	Source	Feature	Start	End	Score	Strand	Frame	Group
chrUextra	FlyBase	gene	523024	523086	.	-	.	gene_id "FBgn0264003"; gene_version "5"; gene_name "mir-5613"; gene_source "FlyBase"; gene_biotype "pre_miRNA";
chrUextra	FlyBase	transcript	523024	523086	.	-	.	gene_id "FBgn0264003"; gene_version "5"; transcript_id "FBtr0330361"; transcript_version "5"; gene_name "mir-5613"; gene_source "FlyBase"; gene_biotype "pre_miRNA"; transcript_name "mir-5613-RM"; transcript_source "FlyBase"; transcript_biotype "pre_miRNA";
chrUextra	FlyBase	exon	523024	523086	.	-	.	gene_id "FBgn0264003"; gene_version "5"; transcript_id "FBtr0330361"; transcript_version "5"; exon_number "1"; gene_name "mir-5613"; gene_source "FlyBase"; gene_biotype "pre_miRNA"; transcript_name "mir-5613-RM"; transcript_source "FlyBase"; transcript_biotype "pre_miRNA"; exon_id "FBtr0330361:1"; exon_version "1";
chrUextra	FlyBase	transcript	523024	523048	.	-	.	gene_id "FBgn0264003"; gene_version "5"; transcript_id "FBtr0330363"; transcript_version "5"; gene_name "mir-5613"; gene_source "FlyBase"; gene_biotype "pre_miRNA"; transcript_name "mir-5613-RB"; transcript_source "FlyBase"; transcript_biotype "miRNA";
chrUextra	FlyBase	exon	523024	523048	.	-	.	gene_id "FBgn0264003"; gene_version "5"; transcript_id "FBtr0330363"; transcript_version "5"; exon_number "1"; gene_name "mir-5613"; gene_source "FlyBase"; gene_biotype "pre_miRNA"; transcript_name "mir-5613-RB"; transcript_source "FlyBase"; transcript_biotype "miRNA"; exon_id "FBtr0330363:1"; exon_version "1";

There is not much that can be done - since it doesn't fit either GTF or GFF3 format (for a few reasons), the sniffer is falling back to the GFF format, which is the original with a looser specification - can put almost anything in the 9th "group/attribute" column and have what is technically a valid-ish GFF file for some use-cases.

Let's close this ticket out. If these datasets are only used with the intended tools, the assigned format is Ok. And is probably one reason why tools that consume these inputs are vague or combine "GFF3/GFF/GTF" format in the help text areas and associated tutorials. Some wrapped 3rd party tools produce custom hybrid datasets that are tool-group specific but close enough to other known datatypes that they are sniffed/assigned to them. This is how bioinformatics works in real life! :)

The only change that could be made is to rename the dataset at the source (especially if associated with our tutorials or publications) to have the .gff extension instead of gtf so that no one getting the data out of context thinks it is in GTF format. I'll open a ticket for this in training repo.

Feb 23 '18 18:02 jennaj

related to https://github.com/galaxyproject/galaxy/issues/2298

May 23 '18 21:05 jennaj

Ran into this issue (a GTF file sniffed as GFF) while testing the current ref-based RNA-seq tutorial using the GTF file provided in Zenodo (here: https://zenodo.org/record/1185122/files/Drosophila_melanogaster.BDGP6.87.gtf) in usegalaxy.eu.

And if I try with specifying the Type as "GTF" during upload I get a warning "Warning: The file 'Type' was set to 'gtf' but the file does not appear to be of that type" in the dataset in the history, see below.

Not sure where the Zenodo GTF file comes from, think Flybase or Ensembl.

Jul 02 '18 00:07 mblue9

@jennaj could you reopen this issue when you get a chance please as I'm seeing it with the current Ensembl GTF files e.g. Fly: ftp://ftp.ensembl.org/pub/release-92/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.92.gtf.gz Human: ftp://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/Homo_sapiens.GRCh38.92.gtf.gz

The Ensembl GTFs are auto-importing as GFF, and if GTF is set as the Type on import, the warning appears that the file doesn't appear to be GTF. GTF is equivalent to GFF2 format but GTF format should take precedence here (and there shouldn't be a warning output saying this doesn't appear GTF format, it is GTF) imo. Is it a matter of reordering the GTF/GFF datatypes priority somewhere?

Jul 06 '18 05:07 mblue9

The first line of that file - parsed as:

[u'chr3R', u'FlyBase', u'gene', u'722370', u'722621', u'.', u'-', u'.', u'gene_id "FBgn0085804"; gene_name "CR41571"; gene_source "FlyBase"; gene_biotype "pseudogene";']

does not have a transcript_id so Galaxy does not sniff it as a gtf - https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/interval.py#L1059.

May 07 '19 15:05 jmchilton

http://genome.ucsc.edu/FAQ/FAQformat#format4

The attribute list must begin with the two mandatory attributes:

gene_id value - A globally unique identifier for the genomic source of the sequence. transcript_id value - A globally unique identifier for the predicted transcript.

May 07 '19 15:05 jmchilton

Rats. @mblue9 what do you think?

All these custom formats for GFF/GFF2/GTF/GFF3 are ... problematic

These data look like GFF3 that was modified to be in GTF format, but only some lines actually meet with the spec for GTF. Data also includes headers that do not fit GTF (should be none) or GFF3 (which wouldn't fit the content).

Loaded both into the history here: https://usegalaxy.org:/u/jen/h/test-gtf-downloads

What tools actually require can vary. Content and format both factors. We need to figure out a way to handle these somehow. Avoid the wrong datatype for format reasons (unexpected headers) and base it just on content. Then tools need to know how to handle that input based on content/datatype (ignore headers or we strip them out ourselves at Upload).

Or we just keep on defaulting back to a generic GFF and have users clean up the data. Which users have problems with. I don't know the right path for this data source. Some tools will work with it fine, some will not.

May 07 '19 17:05 jennaj

usegalaxy-playbook usegalaxy-playbook copied to clipboard

gtf sniffed as gff

usegalaxy-playbook
usegalaxy-playbook copied to clipboard