Transcript ends are supported by low coverage
Hello,
I am using stringtie for long-read ONT data and I came across an issue with transcript ends being trimmed not enough, although a drop in coverage is clearly visible.
I am running stringtie in the long-read mode with the following settings:
stringtie -L -a 10 -j 2 -p 8 -G <gencode.gtf> -o <output.gtf> <input.bam>
In the documentation I see the -t option, which can be used to disable the trimming of transcripts. So I assume this is turned on as default. However, I can't find any option to manipulate how strong this coverage based trimming is performed. An option that lets one decide how strong the coverage drop needs to be in order to trimm the transcript would be helpful.
I've attached a screenshot below of an example where really long transcripts are created even though the coverage drops from 1700 to 7 reads and is still not trimmed. I would have expected that in this case 7 reads is not enough to keep the transcript going. Thus trimming the transcript at 128 reads or 350 reads is what I was hoping to see.
Is there a way to get stringtie to behave likte this?
This issue gets even more inflated when used with stringtie --merge in a downstream workflow, since it prefers longer transcript ends when merging multiple samples.
Any help on this would be much appreciated !
Thanks you in advance !
I just wanted to second this issue, and say that this is also a common problem that I have seen extensively in short read assembly, as well as short read + long read mixed assembly. I've been working on a post-hoc approach that uses some custom scripting to do the trimming downstream of Stringtie, but a better solution implemented in StringTie would be awesome.
Hi isaacvock, it's good to see that I'm not the only one having trouble with this. How exactly do you solve the problem in your post-hoc approach? Do you sort of re-calculate the coverage per transcript, or do you trimm off ends based on a reference annotation? Also have you tried any other transcriptome assembly software (maybe FLAIR?)
The workflow looks like:
- Split the exonic and intronic regions of the assembled annotation into bins of around 100 nt (can adjust this to increase resolution of trimming)
- Use featureCounts and short read data to quantify exonic and intronic bin coverage
- Estimate the median intronic bin coverage (and regularize estimate based on global exon vs. intron coverage trends).
- NHST to determine if each exonic bin is "significantly" higher coverage than the intronic "background".
- Trim ends to the first significantly well covered exonic bin.
- Don't trim transcripts with internal exonic bins lower than intronic coverage (i.e., whose removal would introduce a new splice junction) and instead flag these as "problematic" (new boolean column added to the GTF)
I've mainly worked with short read data and only in the last couple months got into short + long read mixed assembly. I've tried Scallop but have preferred StringTie and am excited about the recent improvements made in version 3. One of my labmates working primarily with long read data has tried FLAIR, but I think they preferred StringTie just based on manual spot-checking of lots of genes. Not sure if FLAIR's end calling is better though.
One thing I haven't tried yet is providing a "point-feature" list to StringTie (via the --ptf argument), which may in theory help trim ends to annotated TES.
Best, Isaac
Hi Isaac,
sorry for the late response, I've somehow missed your reply.
Thanks for sharing the workflow. Restimating the coverage in bins makes sense to me. Maybe one could try to add a dynamic part, where the resolution of the bins gets increased (aka smaller bin sizes) towards the ends of the transcripts. I might play around with it.
Regarding FLAIR, for what I've seen I would also agree that StringTie usually works better. Now that StringTie3 is out I might compare the two to see how good they are. In general it seems to be really hard to get transcript end calling right, without the use of additional orthogonal data.
How would you use the point feature option? Just take all TES from GENCODE for instance and use them for the assembly?
Best, Mirko
Hi @MirkoBr - I am curious if you tried StringTie3 and whether it helped resolve the transcript ends (I believe it should have)?
Hi @MirkoBr,
Sorry for the delay as well. The dynamic binning strategy sounds like a great idea!
In regards to using the point features, yeah that was the simplest strategy I was thinking about, that or get some published 3'-end sequencing data in your cell line and use that to identify ends that you pass as point features. Like I said though, I haven't tried this option yet.
I've tried StringTie3 and have generally been very pleased with the improvements (namely less artifactual isoforms, like bogus gene fusions). In terms of the trimming (and this also gets to @ishinder's question), I anecdotally seen a slight improvement, at least with long + short read mix, but I may be underestimating the improvement due to the reference I am using to guide the process also being overextended relative to what we see in our particular cell line. Some examples below (tracks are from TimeLapse-seq experiments, so color is mutational content, with redder regions representing faster turnover RNA)
@isaacvock - it looks like the coverage in your igv screenshot is from short reads? Is the assembly done with hybrid mode (with short and long reads)? When using short reads, it's difficult to determine where the end of a transcript should be (because of the gradual, rather than a sharp, decline in coverage), and so it's likely that StringTie is deferring to the reference annotation. My hunch is that the StringTie3 long read mode only should resolve these ends better.
@ishinder Yes, this is short read data and the assembly is with the hybrid (--mix) mode, but I can confirm that with long read only it doesn't get any better
Interesting aside, I mentioned that the short read data is TimeLapse-seq data. What's cool about this data is that the mutational content (the redness of the short read tracks) tells us about the turnover kinetics of the RNA from which the reads were derived. In both the mix and long read only assemblies, StringTie is likely getting fooled by DoG (downstream of gene) transcription, which is rapidly turned over via (probably) Xrn2 torpedoing Pol II downstream of a CPA event. Consistent with that, the end of the high long read coverage also marks the start of high mutation content (mostly red) parts of the short read tracks. So even though you are right about the weird fall off making it hard to call ends, the mutational content in the short read data corroborates the long read ends and the fact that these assemblies all have overextended 3' UTRs:
Though this could still just be a problem of snapping to the reference ends, I am not sure if StringTie is designed to trim shorter than the reference used to guide the assembly. Maybe the situation @MirkoBr flagged to open this Issue is resolved with StringTie3
@isaacvock Thanks for sharing this interesting data, I've never looked at TimeLapse-seq data before. It's really cool to see that these long 3'UTRs are related to faster turnover. When you say that they are likely related to CPA does that mean we see a change in polyA site usage (APA)? In any case I would still expect StringTie to create two different isoforms one with the proximal and one with the distal polyA site.
The IGV shot with the long read data looks still shocking to me. The drop in coverage is really sharp. Is that surprising for you @ishinder ?
It's also an interesting thought that StringTie might be too much guided by the annotation in these cases. I'll try it in de-novo mode to see if that's the case.
Hi @MirkoBr and @isaacvock,
I believe that StringTie primarily focuses on reconstructing the correct intron chain and often collapses multiple termination sites into a single “representative” transcript that's in the reference annotation. I’m curious if, for your use-case/analysis, you need to separate these isoforms? (Perhaps it can be a future patch?)
@isaacvock – I haven’t heard about TimeLapse-seq before, but it looks very cool. I just read about it and realized that you might also benefit from running StringTie3 in nascent mode. The ‘nascent mode’ was designed to differentiate between nascent transcripts and mature transcripts in rRNA depleted (total) RNA-seq libraries. The nascent mode also works in conjunction with hybrid mode (short+long mixed RNA-seq data). To activate nascent mode, you can use the -N or --nasc flag (both flags implement the same underlying algorithm, but the --nasc will also output the assembled nascent transcripts in the GTF file. These transcripts will be labeled as “nascentRNA” in the GTF file).