BamQC icon indicating copy to clipboard operation
BamQC copied to clipboard

Insert size plot question

Open ewels opened this issue 10 years ago • 26 comments

Hello,

I have a question about what the insert size plot is calculating exactly. I ask because I get the same profile, but different numbers, with Picard CollectInsertSizeMetrics:

BamQC: image

Picard: image

These were run on the same input file. Any idea why they're different?

Phil

ewels avatar Nov 18 '15 12:11 ewels

I don't know as I didn't create this analysis. For the y axis, it seems that the tail 0s are removed. The x-axis seems not consistent though.. Thanks for pointing this out. Piero

pdp10 avatar Nov 18 '15 13:11 pdp10

Ha! That'll be a bug. I think there's a regex which removes tailing zeros from when all the plots used to be using floating point on the y axis. Guess it doesn't do the right thing when it's integers!

s-andrews avatar Nov 18 '15 13:11 s-andrews

I had a quick look at this. The class calculating that distribution is "CalculateDistribution" in uk.ac.babraham.BamQC.Utilities . I extracted this code from the module InsertLengthDistribution to make it reusable, but I did not write it. It seems that the the tail 0s are actually not removed.

pdp10 avatar Nov 18 '15 13:11 pdp10

Ok, nice. I'm not really fussed about the y axis to be honest (I'm not actually sure what those numbers refer to?), I'm more worried about the x axis. BamQC is reporting a median insert size of ~80 bp, whereas Picard comes in at around 220 bp...

ewels avatar Nov 18 '15 14:11 ewels

The class CalculateDistribution seems to work correctly to me. You can play with the BIN_SIZE parameter to have more or less values for each bar. This changes the y-axis, but certainly not as much as the plot shown using Picard. The x-axis is still the same though. The values for x and y axis labels are not altered either. They come from the distribution. In particular:

  • y-axis : this is the sum of percent values and depends on BIN_SIZE. This means that if let's say 10 counts are collected (BIN_SIZE=10), each column will be the sum of 10 percentages.
  • x-axis : This is dependent on the values extracted from read.getInferredInsertSize() . I do not see any alteration or re-scaling in the code.

What do the x and y axes indicate in the Picard plot? There is no axis label.

pdp10 avatar Nov 18 '15 14:11 pdp10

Sorry, I plotted it in excel quickly, my fault. The Picard documentation isn't super clear, but the column headers are insert_size and All_Reads.fr_count.

My understanding is that the insert size is measured from the first bp of read 1 to the first bp of read 2 (everything sequenced, the whole fragment excluding the adapters). For picard, they y axis is then the number of fragments found with that exact length.

Note that there is some confusion about this with other tools - for instance, sometimes 'insert size' is used to describe the distance in-between the two reads (eg. the RSeQC inner distance plot) - not as helpful in my mind, as dependent on sequencing length. See seqanswers posts about this here and here.

I wonder if this is what's happening in BamQC? It would explain the lower numbers. But I would expect some overlapping reads in my example above and the axis doesn't go down to 0 / negative numbers. So maybe not.

ewels avatar Nov 18 '15 14:11 ewels

Thanks for explaining me this. I think we have to go through this more carefully. Surely we need to describe this clearly in the help.

pdp10 avatar Nov 18 '15 14:11 pdp10

No problem! Yes, that would be good :books: :+1:

ewels avatar Nov 18 '15 15:11 ewels

We don't calculate that value. It's taken from the TLEN field in the BAM file, so it's whatever the read mapper puts in there. It should be the full inferred insert length - at least that's what the docs say.

s-andrews avatar Nov 18 '15 15:11 s-andrews

Which docs? ;) Presumably this could differ for various aligners? These plots were done with Bismark anyway, so presumably bowtie2. Any ideas why this reported length would be different to Picard? I'll see if I can open a downsampled version in SeqMonk to plot the same thing for extra validation.

ewels avatar Nov 18 '15 15:11 ewels

It shouldn¹t differ for different aligners - the docs are pretty specific about what it¹s supposed to mean.

I know that we did have problems when we were using this value in seqmonk as there were cases where the value reported was technically correct, but not very useful (can¹t remember the details sorry).

Simon.

s-andrews avatar Nov 18 '15 15:11 s-andrews

Ah ok. I've heard of problems before when reads are mapped FF or RR instead of FR - distance is taken from first bp of each, so misses a read in FF or RR orientation.

Anyway.. still no closer to figuring out why there's a difference in the plots?

Haven't managed to get SeqMonk to load a file yet. Would be nice if there was an option when importing BAM files to stop after [x] reads :wink:

ewels avatar Nov 18 '15 16:11 ewels

There is.

samtools view [file] | head -100000 > small.sam

..and then import that.

:-)

s-andrews avatar Nov 18 '15 16:11 s-andrews

Yup, that's pretty much what I was already running (-s 0.1 instead of head). Yours will be much faster though, I'll kill mine and try again.

ewels avatar Nov 18 '15 16:11 ewels

image

Picard wins! Thankfully, as I wanted this data to have long insert sizes.

ewels avatar Nov 18 '15 16:11 ewels

What do you get if you extract the tlen values and plot a histogram though?

s-andrews avatar Nov 18 '15 16:11 s-andrews

I need to go home! Do you have a command kicking around to extract the TLEN values before I start googling? I'll have a go tonight / tomorrow.

ewels avatar Nov 18 '15 16:11 ewels

It¹s a tab delimited file. Just do it in excel!

s-andrews avatar Nov 18 '15 16:11 s-andrews

You mean the BamQC text output? But surely that doesn't tell us anything new? I thought you meant extract the TLEN flags from the BAM file and plot them?

Data from bamqc_data.txt: image

ewels avatar Nov 18 '15 16:11 ewels

I did. SAM files are tab delimited files.

s-andrews avatar Nov 18 '15 17:11 s-andrews

Yes, yes they are.. Sorry :)

Ok, histogram of the subset of TLEN values. I used =abs() as read 2 values are negative (so double counts):

image

..so looks like BamQC is doing something funny to these numbers?

ewels avatar Nov 18 '15 21:11 ewels

Does Picard process only proper paired reads for that analysis?

In BamQC, only reads having the flags getReadPairedFlag() and getProperPairFlag() set to TRUE are parsed. Maybe this is the difference. If you suspect this could be reasonable, could you temporarily comment that test in BamQC and see whether the resulting plot is closer to the Picard plot, please?

Class: uk/ac/babraham/BamQC/Modules/InsertLengthDistribution.java Method: public void processSequence(SAMRecord read)

pdp10 avatar Nov 20 '15 14:11 pdp10

I'd be fairly surprised if this is the cause, as I would imagine that such an effect would cause BamQC to give longer insert lengths instead of shorter. I'm at a meeting now but will try to test your suggestion in a bit..

ewels avatar Nov 20 '15 15:11 ewels

It's more likely to be a screw up on our end. Let me take a look.

s-andrews avatar Nov 20 '15 15:11 s-andrews

Thanks to both of you!

pdp10 avatar Nov 20 '15 15:11 pdp10

I examined the BamQC code a bit better. The x-axis groups the lengths using a parameter BIN_SIZE. I made some test with this parameter set to 1 (meaning that now the x-axis directly reflects the value from read.getInferredInsertSize() in SAMRecord). This configuration is currently committed.

When you (@ewels) have a change, could you let me know whether the plot is closer to the picard result using the above data, please? Thanks

pdp10 avatar Jan 13 '16 13:01 pdp10