inconsistency between transcript_tpm.tsv and transcript_model_tpm.tsv
Hi Thank you for developing this useful tool!
I would like to inquire about the differences between transcript_tpm.tsv and transcript_model_tpm.tsv. The file transcript_model_tpm.tsv contains the expression of discovered transcript models in TPM (and corresponds to transcript_models.gtf). It should include all expressed transcripts, both novel and known, correct? However, when I search for a specific transcript, such as the canonical transcript ENST00000275493, I find it only exists in transcript_tpm.tsv with a high TPM value, while it is absent from transcript_model_tpm.tsv. I understand that those are two different algorithms: reference-based and discovery, but it is a bit weird that the ENST00000275493 is completely absent from transcript_model_tpm.tsv. Any reason for this inconsistency?
Smilarly, ENST00000275493 is absent from transcript_models.gtf. So, would you recommend using transcript_models.gtf or extended_annotation.gtf for downstream analyses, such as SQANTI3?
Thank you!
Dear @Mangosteen24
Thanks you for the feedback!
You understanding is correct, and this is indeed a little bit odd. To understand where this inconsistency stems from one has to go deeper in the algorithms and data.
A few questions do you use. Which version do you use? Some of the inconsistencies were fixed at some point, but I cannot guarantee all of the are eliminated.
What reads are assigned to these isoforms in .read_assignments.tsv.gz and what are their assignment types? Where do these reads go in .transcript_model_reads.tsv.gz?
Best Andrey
Hi Andrey I use the latest version of isoquant v3.6.1
First I checked what reads were assigned to ENST00000275493 and their assignment_type in OUT.read_assignments.tsv.gz
1165 ambiguous
659 inconsistent
13394 inconsistent_ambiguous
20156 inconsistent_non_intronic
10032 unique
19 unique_minor_difference
Then I checked those 10032 unique read_id in OUT.transcript_model_reads.tsv.gz
9049 *
219 ENST00000344576
28 ENST00000450046
3 ENST00000485503
236 transcript13328.chr7.nic
56 transcript13334.chr7.nnic
21 transcript13336.chr7.nic
108 transcript13359.chr7.nic
193 transcript13376.chr7.nic
101 transcript13394.chr7.nic
3 transcript13436.chr7.nic
3 transcript13455.chr7.nic
1 transcript13579.chr7.nnic
804 transcript13776.chr7.nnic
9 transcript13842.chr7.nnic
621 transcript13858.chr7.nnic
142 transcript13993.chr7.nnic
3 transcript13995.chr7.nic
It seems that most unique reads were assigned to * instead of ENST00000275493, which means it is not a known transcript, or NIC or NNIC? I also noticed that most of the unique reads' assignment events were 'mono_exonic'; maybe that's the reason it cannot differentiate which isoform they come from?
Thank you for providing the insights!
I think that's what happens here.
These reads map to some unique parts of the ENST00000275493 isoform (i.e. some exons that are only present in this transcript), and thus are uniquely assigned. However, to report a transcript model during transcript discovery IsoQuant need full splice matching reads. I suspect it was not reported specifically because there are none. Thus, all these reads were not assigned to any isoform (*) or got assigned to something else.
I can a further look if you have a chance to send me a BAM file with all the reads from this region or reads assigned to ENST00000275493 (all types).
Best Andrey
Thank you @andrewprzh . I have a follow-up question. Since I have a huge dataset, around 40 BAM files, I want to run them simultaneously to obtain a single annotation file. This way, each sample will have the same ID for NIC/NNIC for comparison. However, the process seems to be stuck after running for 24 hours. In the log file, the last line is from a few days ago: 2024-10-26 - INFO - Finished processing chromosome chr10.. Here is my command. Could you kindly suggest a solution?
isoquant.py --data_type nanopore --yaml /home/dataset.yaml --complete_genedb --genedb /home/refdata-gex-GRCh38-2020-A/genes/genes.gtf --reference /home/refdata-gex-GRCh38-2020-A/fasta/genome.fa --output /home/isoquant_exp --read_group tag:CB --sqanti_output --threads 32
dataset.yaml
[
data format: "bam",
{
name: "exp",
long read files: [
"/home/tagged1.bam",
"/home/tagged2.bam",
"/home/tagged3.bam",
"/home/tagged4.bam",
"/home/tagged5.bam",
"/home/tagged6.bam",
"/home/tagged7.bam",
"/home/tagged8.bam",
"/home/tagged9.bam",
"/home/tagged10.bam",
"/home/tagged11.bam",
"/home/tagged12.bam",
"/home/tagged13.bam",
"/home/tagged14.bam",
"/home/tagged15.bam",
"/home/tagged16.bam",
"/home/tagged17.bam",
"/home/tagged18.bam",
"/home/tagged19.bam",
"/home/tagged20.bam",
"/home/tagged21.bam",
"/home/tagged22.bam",
"/home/tagged23.bam",
"/home/tagged24.bam",
"/home/tagged25.bam",
"/home/tagged26.bam",
"/home/tagged27.bam",
"/home/tagged28.bam",
"/home/tagged29.bam",
"/home/tagged30.bam",
"/home/tagged31.bam",
"/home/tagged32.bam",
"/home/tagged33.bam",
"/home/tagged34.bam",
"/home/tagged35.bam",
"/home/tagged36.bam",
"/home/tagged37.bam"
],
labels: [
"Replicate1",
"Replicate2",
"Replicate3",
"Replicate4",
"Replicate5",
"Replicate6",
"Replicate7",
"Replicate8",
"Replicate9",
"Replicate10",
"Replicate11",
"Replicate12",
"Replicate13",
"Replicate14",
"Replicate15",
"Replicate16",
"Replicate17",
"Replicate18",
"Replicate19",
"Replicate20",
"Replicate21",
"Replicate22",
"Replicate23",
"Replicate24",
"Replicate25",
"Replicate26",
"Replicate27",
"Replicate28",
"Replicate29",
"Replicate30",
"Replicate31",
"Replicate32",
"Replicate33",
"Replicate34",
"Replicate35",
"Replicate36",
"Replicate37"
]
}
]
@Mangosteen24
Could you send me the entire log file? Sometime reading many BAM files in parallel may result in a slower performance, merging BAMs into a single one and using tags/TSV file to group counts may possibly be a faster solution.
Best Andrey
I tried again with 27 samples; however the server shut down unexpectedly. At the time, most chromosomes seemed to have finished processing except for four: chr1, chr7, chr11, and chrM. Details are shown in the attached log1.
isoquant.py --data_type nanopore --yaml /home/Work/isoquant_27samples/dataset.yaml --complete_genedb --genedb /home/_shared/database/refdata-gex-GRCh38-2020-A/genes/genes.gtf --reference /home/_shared/database/refdata-gex-GRCh38-2020-A/fasta/genome.fa --output /home/Work/isoquant_27samples_try2 --read_group tag:CB --sqanti_output --threads 32
I then resumed the run and monitored the memory usage closely. I noticed that memory consumption increased to more than 600GB while processing chromosome 1 (see log2), so I manually stopped the program to prevent further issues. At that time, chr1, chr7, chr11, and chrM all appeared as ‘Loading read assignments’ but didn’t finish.
isoquant.py --resume --output /home/Work/isoquant_27samples_try2 --threads 32
After that, I resumed again with only 2 threads, hoping to reduce memory usage. At start, chr1 and chr7 were being processed. Updates after resume2 for 30 hours: chr1 finished, chr7 and chr11 were being processed. Around 6 hours later, memory started to increase dramatically, so I had to stop the run again (see log3). I am unsure whether I should resume again or try a different approach. Could you please advise the next move?
isoquant.py --resume --output /home/Work/isoquant_27samples_try2 --threads 2
Furthermore, would it be helpful if
- remove chrM from the 27 BAM files and try again?
- Merge all samples’ BAM files by chromosome, e.g., run chr1 from all samples together? Thanks!
@Mangosteen24
Sorry for the delay.
I think the main issue here is --read_group tag:CB. How many barcodes do you have in total?
Although grouped counts were optimized, outputting them might still take a lot of time and possibly memory.
I will continue optimizing this step.
Removing chrM might work as well, it's been known to be very slow when too many reads map onto it.
Best Andrey
Dear @andrewprzh, thank you so much for all the good work on recent isoquant developments!
Similar to Mangosteen24 we've observed multiple confusing scenarios when exploring SAMPLE_ID.transcript_counts.tsv and SAMPLE_ID.transcript_model_counts.tsv entries.
Some of those can be explained by high number of reads assigned to * or another transcript. Some still leave us puzzled. For example:
-
How come the following transcript is present only in SAMPLE_ID.transcript_counts.tsv, and not also in SAMPLE_ID.transcript_model_counts.tsv ENST1: 3 ambiguous 24 inconsistent 72 inconsistent_ambiguous 30 inconsistent_non_intronic 16 unique all unique reads map to ENST1.
-
Why is the following entry in SAMPLE_ID.transcript_model_counts.tsv annotated as ENST* and not as novel: ENST3: 8 inconsistent 7 inconsistent_ambiguous 17 inconsistent_non_intronic 1 unique_minor_difference
When the only unique_minor_difference read maps to transcript492110.chr1.nic?
Of note, transcript492110.chr1.nic exists as a separate entry in SAMPLE_ID.transcript_model_counts.tsv.
Thank you in advance for your help navigating the outputs! Let me know if it would be helpful to see additional files.
Best, Barbara
we ran isoquant like so:
Running IsoQuant version 3.6.1
isoquant.py --stranded none --complete_genedb --gene_quantification all --transcript_quantification all --splice_correction_strategy default_ont --model_construction_strategy sensitive_ont --data_type nanopore --sqanti_output --threads 25 --reference ... --genedb ... --bam ... -o ... --read_group tag:CB
Dear @bpuzek
Thank you for your feedback!
It's quite hard to answer without having the data. However, as I mentioned above - these differences are normal and expected. When using simulated data, both methods exhibit fairly strong correlation with the ground truth.
Overall, I think it's pretty safe to proceed with any of those depending on your goal (whether you need novel transcripts or not). Every tool including IsoQuant has a room for false positives, and I doubt it is possible to eliminate all biases and discrepancies. According to various independent studies IsoQuant is one of the best (or even the best) in terms on isoform discovery and quantification.
-
In order to have transcript in
transcript_model_counts.tsv, it has to be reported in output GTF. There are a lot of various filters, high number of unique counts does not guarantee that all filters will be passed. Also, you use--gene_quantification all --transcript_quantification all, which may inflate the abundances. Any particular reason for choosing these parameters? -
Again, its hard to figure out without looking at the data. I suspect that inconsistent_non_intronic reads are simply ones with different TSS/TES, but have an intron chain consistent with the annotation, and therefore present sufficient evidence for this transcript to be reported.
You may send some data to me, but due to current load I cannot promise I can look at that any time soon - the TODO list for IsoQuant is now enormous and will probably take more than a year to implement all of planned items.
Best Andrey
Dear @andrewprzh Thanks for your reply.
I finally successfully finished the run. When interpreting the results, I found that the most identified NIC/NNIC could not be confirmed by IGV (no obvious supporting reads in the junction plot/sashimi plot). Here are two examples:
The first example shows transcript166104.chr15.nic (count from transcript_model_counts.tsv: 3162). No obvious supporting reads in the junction plot/sashimi plot could be identified to support the alternative splicing of the last two exons. In fact, the signal of the last exon mostly comes from many short-length reads highlighted in blue.
The second example shows transcript429221.chr17.nnic (count from transcript_model_counts.tsv: 12066). Again can't find supporting reads from the junction plot or reads that support the linkage between the two exons. Are there any potential reasons for this, or is my interpretation incorrect?
Thanks a lot! appreciate your help
Dear @andrewprzh My apologies for a lot of questions. I think the tool is really useful and try to make good use of it.
As I mentioned previously, I ran the program with the --read_group tag:CB. Then, I also tried running it without specifying tag:CB, assuming it would run in a pseudobulk way (by sample), right? I used gffcompare to compare the results from the two methods and found only limited novel transcripts in common. Is it because when specifying tag:CB, each cell has limited reads, which may lead to unreliable discovery of novel transcripts?
Another question I have is about merging BAM files by cell type for each patient. I have data from 30 patients and 10 cell types, resulting in 300 pseudobulk BAM files (by celltype and by sample). Are there any potential issues you foresee with this approach? The code I am planning to run is:
isoquant.py --data_type nanopore --yaml dataset.yaml --complete_genedb --genedb genes.gtf --reference genome.fa --output /home/isoquant_exp --sqanti_output --threads 16
dataset.yaml
[
data format: "bam",
{
name: "exp",
long read files: [
"/home/sample1_celltype1.bam",
"/home/sample1_celltype2.bam",
"/home/sample1_celltype3.bam",
"/home/sample1_celltype4.bam",
"/home/sample1_celltype5.bam",
"/home/sample1_celltype6.bam",
"/home/sample1_celltype7.bam",
"/home/sample1_celltype8.bam",
"/home/sample1_celltype9.bam",
"/home/sample1_celltype10.bam",
"/home/sample2_celltype1.bam",
"/home/sample2_celltype2.bam",
"/home/sample2_celltype3.bam",
"/home/sample2_celltype4.bam",
"/home/sample2_celltype5.bam",
"/home/sample2_celltype6.bam",
"/home/sample2_celltype7.bam",
"/home/sample2_celltype8.bam",
"/home/sample2_celltype9.bam",
"/home/sample2_celltype10.bam",
...
"/home/sample30_celltype1.bam",
"/home/sample30_celltype2.bam",
"/home/sample30_celltype3.bam",
"/home/sample30_celltype4.bam",
"/home/sample30_celltype5.bam",
"/home/sample30_celltype6.bam",
"/home/sample30_celltype7.bam",
"/home/sample30_celltype8.bam",
"/home/sample30_celltype9.bam",
"/home/sample30_celltype10.bam",
],
labels: [
"Replicate1",
"Replicate2",
"Replicate3",
"Replicate4",
"Replicate5",
"Replicate6",
"Replicate7",
"Replicate8",
"Replicate9",
"Replicate10",
"Replicate11",
"Replicate12",
"Replicate13",
"Replicate14",
"Replicate15",
"Replicate16",
"Replicate17",
"Replicate18",
"Replicate19",
"Replicate20",
...
"Replicate291",
"Replicate292",
"Replicate293",
"Replicate294",
"Replicate295",
"Replicate296",
"Replicate297",
"Replicate298",
"Replicate299",
"Replicate300"
]
}
]
Looking forward to having your comment on this. Thanks a lot!
Dear @Mangosteen24
Really sorry for such a long delay - I somehow missed a few replies on GitHub.
Regarding unsupported novel transcripts - did you check transcript_model_reads.tsv file. I should contain reads supporting these novel transcripts and will help you understand their origin. IsoQuant should not report completely unsupported transcripts, but if you think this is the case - I can try looking at them.
As I mentioned previously, I ran the program with the --read_group tag:CB. Then, I also tried running it without specifying tag:CB, assuming it would run in a pseudobulk way (by sample), right? I used gffcompare to compare the results from the two methods and found only limited novel transcripts in common. Is it because when specifying tag:CB, each cell has limited reads, which may lead to unreliable discovery of novel transcripts?
The difference is possible. When running with 2 or more files IsoQuant requires a novel transcript to be supported by reads from at least 2 distinct files. Other than that the algorithm is identical.
Another question I have is about merging BAM files by cell type for each patient. I have data from 30 patients and 10 cell types, resulting in 300 pseudobulk BAM files (by celltype and by sample). Are there any potential issues you foresee with this approach?
IsoQuant merges BAM file "on-the-fly", which can be quite slow in case when you have 300 files. Grouped coutns output can also be a bit slow (for 3.6.3 and earlier).
Also, the new IsoQuant 3.7 is out. Processing large number of groups / barcodes should be more efficient now, especially during the output step.
Again, sorry for missing you comments.
Best Andrey