BamQC
BamQC copied to clipboard
Insert size plot question
Hello,
I have a question about what the insert size plot is calculating exactly. I ask because I get the same profile, but different numbers, with Picard CollectInsertSizeMetrics:
BamQC:

Picard:

These were run on the same input file. Any idea why they're different?
Phil
I don't know as I didn't create this analysis. For the y axis, it seems that the tail 0s are removed. The x-axis seems not consistent though.. Thanks for pointing this out. Piero
Ha! That'll be a bug. I think there's a regex which removes tailing zeros from when all the plots used to be using floating point on the y axis. Guess it doesn't do the right thing when it's integers!
I had a quick look at this. The class calculating that distribution is "CalculateDistribution" in uk.ac.babraham.BamQC.Utilities . I extracted this code from the module InsertLengthDistribution to make it reusable, but I did not write it. It seems that the the tail 0s are actually not removed.
Ok, nice. I'm not really fussed about the y axis to be honest (I'm not actually sure what those numbers refer to?), I'm more worried about the x axis. BamQC is reporting a median insert size of ~80 bp, whereas Picard comes in at around 220 bp...
The class CalculateDistribution seems to work correctly to me. You can play with the BIN_SIZE parameter to have more or less values for each bar. This changes the y-axis, but certainly not as much as the plot shown using Picard. The x-axis is still the same though. The values for x and y axis labels are not altered either. They come from the distribution. In particular:
- y-axis : this is the sum of percent values and depends on BIN_SIZE. This means that if let's say 10 counts are collected (BIN_SIZE=10), each column will be the sum of 10 percentages.
- x-axis : This is dependent on the values extracted from read.getInferredInsertSize() . I do not see any alteration or re-scaling in the code.
What do the x and y axes indicate in the Picard plot? There is no axis label.
Sorry, I plotted it in excel quickly, my fault. The Picard documentation isn't super clear, but the column headers are insert_size and All_Reads.fr_count.
My understanding is that the insert size is measured from the first bp of read 1 to the first bp of read 2 (everything sequenced, the whole fragment excluding the adapters). For picard, they y axis is then the number of fragments found with that exact length.
Note that there is some confusion about this with other tools - for instance, sometimes 'insert size' is used to describe the distance in-between the two reads (eg. the RSeQC inner distance plot) - not as helpful in my mind, as dependent on sequencing length. See seqanswers posts about this here and here.
I wonder if this is what's happening in BamQC? It would explain the lower numbers. But I would expect some overlapping reads in my example above and the axis doesn't go down to 0 / negative numbers. So maybe not.
Thanks for explaining me this. I think we have to go through this more carefully. Surely we need to describe this clearly in the help.
No problem! Yes, that would be good :books: :+1:
We don't calculate that value. It's taken from the TLEN field in the BAM file, so it's whatever the read mapper puts in there. It should be the full inferred insert length - at least that's what the docs say.
Which docs? ;) Presumably this could differ for various aligners? These plots were done with Bismark anyway, so presumably bowtie2. Any ideas why this reported length would be different to Picard? I'll see if I can open a downsampled version in SeqMonk to plot the same thing for extra validation.
It shouldn¹t differ for different aligners - the docs are pretty specific about what it¹s supposed to mean.
I know that we did have problems when we were using this value in seqmonk as there were cases where the value reported was technically correct, but not very useful (can¹t remember the details sorry).
Simon.
Ah ok. I've heard of problems before when reads are mapped FF or RR instead of FR - distance is taken from first bp of each, so misses a read in FF or RR orientation.
Anyway.. still no closer to figuring out why there's a difference in the plots?
Haven't managed to get SeqMonk to load a file yet. Would be nice if there was an option when importing BAM files to stop after [x] reads :wink:
There is.
samtools view [file] | head -100000 > small.sam
..and then import that.
:-)
Yup, that's pretty much what I was already running (-s 0.1 instead of head). Yours will be much faster though, I'll kill mine and try again.

Picard wins! Thankfully, as I wanted this data to have long insert sizes.
What do you get if you extract the tlen values and plot a histogram though?
I need to go home! Do you have a command kicking around to extract the TLEN values before I start googling? I'll have a go tonight / tomorrow.
It¹s a tab delimited file. Just do it in excel!
You mean the BamQC text output? But surely that doesn't tell us anything new? I thought you meant extract the TLEN flags from the BAM file and plot them?
Data from bamqc_data.txt:

I did. SAM files are tab delimited files.
Yes, yes they are.. Sorry :)
Ok, histogram of the subset of TLEN values. I used =abs() as read 2 values are negative (so double counts):

..so looks like BamQC is doing something funny to these numbers?
Does Picard process only proper paired reads for that analysis?
In BamQC, only reads having the flags getReadPairedFlag() and getProperPairFlag() set to TRUE are parsed. Maybe this is the difference. If you suspect this could be reasonable, could you temporarily comment that test in BamQC and see whether the resulting plot is closer to the Picard plot, please?
Class: uk/ac/babraham/BamQC/Modules/InsertLengthDistribution.java Method: public void processSequence(SAMRecord read)
I'd be fairly surprised if this is the cause, as I would imagine that such an effect would cause BamQC to give longer insert lengths instead of shorter. I'm at a meeting now but will try to test your suggestion in a bit..
It's more likely to be a screw up on our end. Let me take a look.
Thanks to both of you!
I examined the BamQC code a bit better. The x-axis groups the lengths using a parameter BIN_SIZE. I made some test with this parameter set to 1 (meaning that now the x-axis directly reflects the value from read.getInferredInsertSize() in SAMRecord). This configuration is currently committed.
When you (@ewels) have a change, could you let me know whether the plot is closer to the picard result using the above data, please? Thanks