MISO
MISO copied to clipboard
Best-guessing insert size and stddev using paired-end data versus running miso single ended
I am conducting a miso analysis for a single gene on a large number of RNAseq bams that is spread out on several external HDs (twenty 4tb drives). Computing insert sizes for these is extremely tedious and slow. However, miso for this single gene runs extremely fast. Because of the speed, calculating exact insert sizes is simpy not feasable. For several samples, the mean insert size is around 150-250 and the stddev is more or less 50. Therefore, in my opinion I have two options:
- Run miso in single-end mode
- Run miso in paired-end mode and best guessing insert size at 250 and sd at 50
I wonder if option 2) may be superior because the data is paired end. Can anyone comment on their thoughts? Would be extremely helpful.
There are efficient ways to calculate insert size distribution and many tools that do it apart from MISO (though I don't have experience with them myself). Our code for calculating the insert length distribution can be sped up considerably, it needs to be overhauled. We calculate it by looking at read pairs that land in long constitutive exons (e.g. constitute 3' UTRs). The more constitutive exons you feed our code (in the GFF file), the slower the calculation will be. But you don't need that many constitutive exons to estimate the distribution. So, you can feed fewer exons in to get a quick fix for making it run faster.
Your standard deviation sounds very large (e.g. mean 150 and standard deviation 50). If this is accurate it'll limit the usefulness of paired-end.
Thanks @yarden. Do you have any impressions on how much do the accuracy of the psi values compare when running miso on paired end data using
- paired end mode using accurate insert size versus
- paired end mode using estimated insert size versus
- single end mode
Wondering if it's worth investing in 1) when 2) and 3) are so much simpler.