bcftools
bcftools copied to clipboard
Faster way to process large number of samples.
I do have >60,000 samples to process together. I was wondering about the fastest way to process: a) Provide a bam file list to call SNPs? b) Merge into a single bam file and call SNPs c) generate gVCF from each sample and call SNPs together? Any suggestion will be highly appreciated.
The option c) is problematic because indels will create ambiguities and the VCF merging step will produce very fragmented results. Option a) will create a big overhead from many files opened simulatenously and their headers in memory. Probably best is to create a multi-sample bam. Not necessarily all 60k samples in one, maybe 30 bams with 2k samples? You'll need to run some benchmarks.
If I skip all indels, just SNPs, the option c will be ok? With 2k samples in each bam which is 30 bams all together, how do I run? Give the list of 30 bams as in option a?
Provided the reads are correctly annotated with their readgroup, you can pass the list of multi-sample BAMs as usual via mpileup --bam-list list.txt
. If you don't care about indels, adding the mpileup -I
option will speed up the calling.
I work with thousands of low-coverage samples (~5000), and I find merging into a single BAM and calling much faster to work with than calling from a list of bams.
I have found the sinto package in python useful for manipulating read groups and read tags in a single multi-sample bam file.
Also, if you want the traditional text-pileups for a multi-sample bam, then there's always streaming-pileupy.
"Provided the reads are correctly annotated with their readgroup": I am wondering about the best practice of annotating reads. I read some Q&A and got more confused.
- I have 1000 samples multiplexed together and sequenced in a lane. So, is this a better way of annotating read group ID? I am thinking
ID:
should be unique so that each sample can be distinguished.
Header:
@A00740:74:H2MWHDSXY:1:1101:16514:1438 1:N:0:1
sample1
@RG ID:H2MWHDSXY.1.sample1 SM:sample1 PL:illumina LB:Lib1 PU:H2MWHDSXY.1
sample2
@RG ID:H2MWHDSXY.1.sample2 SM:sample2 PL:illumina LB:Lib1 PU:H2MWHDSXY.1
- The sample1 is replicated in lib2 and sequenced in a different lane. I want them merged together. So this will merge sample1 together (with previous one) in the VCF?
Header:
@MG01HX01:973:H7NJNCCX2:4:1101:23378:2012 1:N:0:1
Replicated sample1
@RG ID:H7NJNCCX2.4.sample1 SM:sample1 PL:illumina LB:Lib2 PU:H7NJNCCX2.4
Hi @sandeshsth, I find read groups confusing too. The tools I have encountered generally use the SM
tag to refer to biological samples and the ID
tag to distinguish reads coming from different lanes (but perhaps still for the same biological sample). I find this post helpful in understanding the difference: https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups
Also, why does the ID
ID:H2MWHDSXY.1.sample2 SM:sample1
in your first example contain both the string sample2
and sample1
?
Hi @winni2k, Thanks for the response. Should be ID:H2MWHDSXY.1.sample2 SM:sample2
. I will change it.
Actually, I was reading the same thread and Q&A.
It looks like GATK uses ID
to identify reads of a particular sample and assign it to SM
. There is an answer from Admin:
"For your definition, the ID fields are identical for your tumor and normal samples and when the files are merged for joint analysis, our tools are then unable to distinguish reads from the two different files."
In your case, it looks like SM
will identify individual samples even if ID
is same such as ID:H2MWHDSXY.1
for all multiplexed samples in a lane. I am wondering how reads are differentiated as ID
is present as RG:Z:H2MWHDSXY.1
?
I used Bcftools to call SNPs in 4 samples:
- input individual bams as a list took 49 min and 46 sec.
- 4 bams were merged into a single file with Samtools and called SNPs, which took 48 min 15 sec.
I was wondering why the speed is not much different between merged vs list of bams. Is there anythig that I missed?
I used Bcftools to call SNPs in 4 samples:
1. input individual bams as a list took 49 min and 46 sec. 2. 4 bams were merged into a single file with Samtools and called SNPs, which took 48 min 15 sec.
I was wondering why the speed is not much different between merged vs list of bams. Is there anythig that I missed?
A small gain in time looks reasonable to me with only 4 bams. Speed improvements aren't visible until you are using perhaps hundreds of bams or more.
I tried with 2,474 Bams:
-
Bcftools with Bam list took 5 hrs 28 min
-
samtools merge
to get a single bam took 1 day and six hours + Bcftools took 2 hrs 43 min
It shows that running a single bam is faster but to generate merged bam is taking a lot of time. Is there any way to speed the merging step?
samtools merge
to get a single bam took 1 day and six hours
Which version of samtools are you running and how large (number of header lines, number of records) are the BAM files?
It shows that running a single bam is faster but to generate merged bam is taking a lot of time. Is there any way to speed the merging step?
In principle, yes: One can run multiple merge commands in parallel on subsets of bams. I have seen the recommendation to merge 500 bams at a time, and then to merge the resulting merged bams. With 2,474 bams that should be five sets of around 500 bams. This is a bit of a hastle to implement in a pipeline though, and merging of bams is something I only do once or twice, so a wait time of a day has been fine with me and I have never tried it.
@valeriuo
Command used (samtools version 1.10):
~/bin/samtools_1.10/bin/samtools merge --no-PG -b 2474_bams_list.txt merged.s.bam
There are 28 headers and ~2 million records in each file. The average file size is 172 MB. The total size of 2,474 files is 426 GB. A single merged file is 289 GB.
I thought --no-PG
will avoid @PG
in the merged file but they are still there, I am not sure why it is not working.
@winni2k Actually, I have ~70000 samples to process. So, I am trying to find the fastest way possible. The next, I will get ~10k more to process and combine with ~70k file. I will have around a week to process the second part.