bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

bcftools csq - GFF Format

Open fbemm opened this issue 4 years ago • 9 comments

Hi,

could someone explain me why the feature type for a line in a GFF is not taken from the third GFF filed but bcftools csq expect each gene and transcript with a prefix (e.g., gene: or transcript:)? Inflates GFFs pretty much with redundant information and introduces IDs that are longer than they actually have to be. Guess there is a rational that I just don't get at the moment.

Thx, Felix

Screenshot from 2019-09-03 15-09-07

fbemm avatar Sep 03 '19 13:09 fbemm

GFFs provided by Ensembl use this convention ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/

pd3 avatar Sep 03 '19 13:09 pd3

That I know but almost no other annotation (tool) does produce such a format. Plus, putting the prefixes there is redundant with the feature column of the GFF format specification. The feature column can probably considered more stable in its definition than a prefix of the attribute ID field. I can provide a patch if this is considered a better way of identifying GFF_TSCRIPT_LINE and GFF_GENE_LINE.

fbemm avatar Sep 03 '19 14:09 fbemm

I am open for this to be changed as long as it continues working with Ensembl files. A more general (and also an easier) solution might be to provide a new script gff2gff to convert between the various flavors of GFF files.

pd3 avatar Sep 03 '19 14:09 pd3

The latter might be a short term fix but one has to remove gene: or transcript: from the annotated VCF in the end again or has to live with the inflated IDs. I just checked the human GFF3:

Testing for a GFF_GENE_LINE would need to be TRUE if the third field of an (Ensembl) GFF contains:

  • gene
  • ncRNA_gene
  • pseudogene

grep "ID=gene:" Homo_sapiens.GRCh38.97.chromosome.1.gff3 | cut -f3 | sort | uniq -c

Testing for a GFF_TSCRIPT_LINE would need to be TRUE if the third field of an (Ensembl) GFF contains:

  • lnc_RNA
  • miRNA
  • mRNA
  • ncRNA
  • pseudogenic_transcript
  • rRNA
  • scRNA
  • snoRNA
  • snRNA
  • unconfirmed_transcript

Best to stay in sync with SequenceOntology (which Ensembl promotes):

Gene --> http://www.sequenceontology.org/browser/current_release/term/SO:0000704 Transcript --> http://www.sequenceontology.org/browser/current_release/term/SO:0000673

fbemm avatar Sep 03 '19 14:09 fbemm

Hi, would be nice if the prefix ("transcript:", "gene:", etc) were optional. This is causing problems in otherwise reasonable GFFs, e.g. : marbl/CHM13#31

brentp avatar Sep 06 '21 06:09 brentp

There are too many possible variations a GFF can have, I don't want to burden bcftools csq with that complexity. I will accept a pull request that extends the https://github.com/samtools/bcftools/blob/develop/misc/gff2gff.py script and adds the prefixes when missing.

pd3 avatar Sep 07 '21 14:09 pd3

Petr, thanks for the reply. I'll look into making a PR for the GFF, the latest error is:

Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).

do you have a recommendation for this? There are no phase annotations in the GFF. thanks for any ideas. -Brent

brentp avatar Sep 15 '21 10:09 brentp

Petr, thanks for the reply. I'll look into making a PR for the GFF, the latest error is:

Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).

do you have a recommendation for this? There are no phase annotations in the GFF. thanks for any ideas. -Brent

@brentp Hi, any updates on this? I'm experiencing the same issue .

Best, DK

dKlee99 avatar Sep 17 '21 04:09 dKlee99

Phase is 8th column of GFF https://www.ncbi.nlm.nih.gov/datasets/docs/reference-docs/file-formats/about-ncbi-gff3. The program detected some inconsistency between expected and observed phase (frame).

pd3 avatar Sep 17 '21 09:09 pd3