trying to understand the `OUT.novel_vs_known.SQANTI-like.tsv`
In the OUT.novel_vs_known.SQANTI-like.tsv file, are all the entries for novel transcripts? If so, I'm trying to wrap my head around why some of them are categorized as FSM. Any pointers on this would be helpful, thank you!
Sowmya
@sparthib
Yes, SQANTI-like output contains only information about novel transcripts. Could send me a example? It would also be nice to see both novel and assigned known transcripts from the GTF file as well.
Best Andrey
Pasting a few lines from the file here.
transcript372.chr1.nnic chr1 + 2357 11 novel_not_in_catalog ENSG00000187634.13 ENST00000341065.8 2191 12 76 1 76 0 alternative_structure_novel;correct_polya_site_right FALSE True NA NA NA NA NA NA NA False NA NA NA C NA NA NA NA NA 930312 944150 NA 0.05 TCCCGTGTCTACTGCCTCCC NA NA NA NA NA NA NA NA NA
transcript376.chr1.nnic chr1 + 2405 11 novel_not_in_catalog ENSG00000187634.13 ENST00000342066.8 2557 14 395 0 76 0 alternative_structure_novel;terminal_site_match_right_precise;correct_polya_site_right FALSE True NA NA NA NA NA NA NA False NA NA NA C NA NA NA NA NA 925942 944150 NA 0.05 TCCCGTGTCTACTGCCTCCC NA NA NA NA NA NA NA NA NA
transcript450.chr1.nnic chr1 - 2174 11 novel_not_in_catalog ENSG00000279457.4 ENST00000623083.4 1397 10 -493 -297 -493 -297 intron_shift;extra_intron_flanking_right;alternative_polya_site_left FALSE True NA NA NA NA NA NA NA False NA NA NA C NA NA NA NA NA -1 -1 NA 0.30 TATTAAAAGCACACTGTTGG NA NA NA NA NA NA NA NA NA
transcript499.chr1.nic chr1 - 6633 11 novel_in_catalog ENSG00000131591.18 ENST00000421241.7 1832 10 30 0 28 0 alternative_structure_known;terminal_site_match_left_precise;correct_polya_site_left FALSE True NA NA NA NA NA NA NA True NA NA NA C NA NA NA NA NA 1091543 1082896 NA 0.05 AGAGCAGCTCGGAACGCAGC NA NA NA NA NA NA NA NA NA
The file has ~30k lines as opposed to the counts file which has over 200k files, so it makes sense that the novel_vs_known is only about categorizing the novel transcripts.
Follow up question: under the additional info column I see terms that I don't see when I run SQANTI manually on my bambu output. Could you explain more how this column works and what are all the possible subcategories I could observe here? Additionally, as seen above, I observe mostly NAs in the rest of the columns but I am unsure what these column correspond to, so a header or description of these columns would be beneficial. Thank you so much @andrewprzh!
@sparthib
I agree, headers would be nice. I'll add them in the next release. I think I used information from SQANTI wiki, but it might have changed over time..
Moreover, IsoQuant does not provide the exact SQANTI-like output, thus, a lot of columns are NAs (it would take a lot of time to re-implement all features). If you'd like to have full SQANTI output, it's better to run SQANTI itself :)
under the additional info column I see terms that I don't see when I run SQANTI manually on my bambu output. Could you explain more how this column works and what are all the possible subcategories I could observe here?
Sorry, which column do you refer to?
Best Andrey
@sparthib
I think I found the reason for the original problem with FSM records. IsoQuant outputs a few novel isoforms that are very similar to the reference ones, which should not be there. I'll make a bug-fix release soon.
@sparthib
This issue should be now resolved starting version 3.5.0.