CHM13 icon indicating copy to clipboard operation
CHM13 copied to clipboard

Problems in gff

Open mpovidlov1 opened this issue 3 years ago • 6 comments

I was looking at the gene annotation files, in particular, http://courtyard.gi.ucsc.edu/~mhauknes/T2T/t2t_Y/annotation_set/CHM13.v2.0.gff3 It looks like the file contains multiple problems, mostly touching exons with introns of size 0. I can send examples

mpovidlov1 avatar Apr 04 '22 22:04 mpovidlov1

@snurk ?

mpovidlov1 avatar Apr 08 '22 18:04 mpovidlov1

The annotations come from liftoff/CAT so this is more a question for @mhaukness-ucsc or @diekhans Are these similar to issues asked in #31 and #37?

skoren avatar Apr 08 '22 18:04 skoren

Thanks. The other issues mention other problems with earlier versions of the annotation files. Mine is quite specific. The records define exons like this (start end): 100 200 201 300

which means that the intron between them is of size 0

mpovidlov1 avatar Apr 08 '22 18:04 mpovidlov1

Hi @mpovidlov1, could you please provide some examples? I think this is likely a result of errors present in the original GENCODE annotations, but I'll look into it.

mhaukness-ucsc avatar Apr 08 '22 23:04 mhaukness-ucsc

Here is an example of the first problematic gene, starts on line 12:

[problems.txt](https://github.com/marbl/CHM13/files/8455851/problems.txt)
 111903112896transcript-
 111903112498exon-
 111940112498CDS-
 111940111942stop_codon-
 112499112896exon-
 112499112877CDS-
 112875112877start_codon-

I have a list of more than 200 problematic genes referenced by line number (attached)

mpovidlov1 avatar Apr 09 '22 00:04 mpovidlov1

Issue moved to CAT repo: https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/285

diekhans avatar Feb 28 '23 18:02 diekhans