IsoQuant icon indicating copy to clipboard operation
IsoQuant copied to clipboard

should extended_annotation.gtf be a superset of the input gtf?

Open jamestwebber opened this issue 1 year ago • 10 comments

This is what I assumed should happen, but it doesn't appear to be the case: my reference GTF has ~61k genes (GRCh38, gencode v39) but the output extended_annotation.gtf does not include all the known genes and transcripts (by a large margin: 23k genes). Is there some filtering going on here?

jamestwebber avatar Apr 17 '24 18:04 jamestwebber

Hi @jamestwebber

Yes, this is a known flaw in the current version, it is now fixed and will be out in 3.4 (hopefully soon).

Best Andrey

andrewprzh avatar Apr 18 '24 09:04 andrewprzh

Should be fixed now in IsoQuant 3.4

andrewprzh avatar May 09 '24 09:05 andrewprzh

I thought this was fixed, but I'm seeing some instances where the exon information for a gene was not copied over. I wonder if this is related to whether or not reads were assigned to the gene.

jamestwebber avatar Sep 16 '24 19:09 jamestwebber

I noticed this initially in an unprocessed pseudogene (WASH7P) just because it happens to be very close to the beginning of chr1. So if there's any filtering based on biotype, that could also be involved.

jamestwebber avatar Sep 16 '24 19:09 jamestwebber

@jamestwebber

There should not additional filtering, so sounds odd. What kind of information is missing, is it exon records? Is it possible to see take a look a this example?

Thanks Andrey

andrewprzh avatar Sep 19 '24 22:09 andrewprzh

Ah! This probably a false alarm: it looks like the transcript name was not copied over, but the exons themselves are present. I was looking for the gene name and didn't see the exons. For example the first exon in both files:

$ grep 'ENST00000488147.1' ~/reference/GRCh38.gencode.v39.annotation.basic.gtf | head -n 2 
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; exon_number 1; exon_id "ENSE00001890219.1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
$ grep 'ENST00000488147.1' OUT.extended_annotation.gtf | head -n 2
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; exons "11"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; transcript_support_level "NA"; hgnc_id "HGNC:38034"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1"; 
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; exon "1"; exon_id "chr1.40908";

jamestwebber avatar Sep 19 '24 22:09 jamestwebber

Yes, additional information such as gene names etc is only copied for genes and transcript records. I can make the same for exons if needed.

andrewprzh avatar Sep 20 '24 09:09 andrewprzh

The reason I noticed this is because I was looking at IGV, and it wasn't displaying the exons for WASH7P, only the gene body. I think this is really a bug in how IGV is parsing the GTF (it should be matching on transcript_id), but you will probably update sooner. 😂

jamestwebber avatar Sep 20 '24 14:09 jamestwebber

Yeah, I thought transcript_id would be enough. Maybe converting to GFF3 and having ID and Parent attributes instead will make it work.

Anyway, will fix exon information.

andrewprzh avatar Sep 20 '24 15:09 andrewprzh

Exon attributes should be now copied from the reference in IsoQuant 3.6.1.

andrewprzh avatar Sep 25 '24 13:09 andrewprzh