MISO icon indicating copy to clipboard operation
MISO copied to clipboard

Best-guessing insert size and stddev using paired-end data versus running miso single ended

Open fpbarthel opened this issue 7 years ago • 2 comments

I am conducting a miso analysis for a single gene on a large number of RNAseq bams that is spread out on several external HDs (twenty 4tb drives). Computing insert sizes for these is extremely tedious and slow. However, miso for this single gene runs extremely fast. Because of the speed, calculating exact insert sizes is simpy not feasable. For several samples, the mean insert size is around 150-250 and the stddev is more or less 50. Therefore, in my opinion I have two options:

  1. Run miso in single-end mode
  2. Run miso in paired-end mode and best guessing insert size at 250 and sd at 50

I wonder if option 2) may be superior because the data is paired end. Can anyone comment on their thoughts? Would be extremely helpful.

fpbarthel avatar Aug 23 '16 21:08 fpbarthel

There are efficient ways to calculate insert size distribution and many tools that do it apart from MISO (though I don't have experience with them myself). Our code for calculating the insert length distribution can be sped up considerably, it needs to be overhauled. We calculate it by looking at read pairs that land in long constitutive exons (e.g. constitute 3' UTRs). The more constitutive exons you feed our code (in the GFF file), the slower the calculation will be. But you don't need that many constitutive exons to estimate the distribution. So, you can feed fewer exons in to get a quick fix for making it run faster.

Your standard deviation sounds very large (e.g. mean 150 and standard deviation 50). If this is accurate it'll limit the usefulness of paired-end.

yarden avatar Aug 24 '16 12:08 yarden

Thanks @yarden. Do you have any impressions on how much do the accuracy of the psi values compare when running miso on paired end data using

  1. paired end mode using accurate insert size versus
  2. paired end mode using estimated insert size versus
  3. single end mode

Wondering if it's worth investing in 1) when 2) and 3) are so much simpler.

fpbarthel avatar Aug 24 '16 13:08 fpbarthel