bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Faster way to process large number of samples.

Open sandeshsth opened this issue 4 years ago • 14 comments

I do have >60,000 samples to process together. I was wondering about the fastest way to process: a) Provide a bam file list to call SNPs? b) Merge into a single bam file and call SNPs c) generate gVCF from each sample and call SNPs together? Any suggestion will be highly appreciated.

sandeshsth avatar Mar 17 '20 18:03 sandeshsth

The option c) is problematic because indels will create ambiguities and the VCF merging step will produce very fragmented results. Option a) will create a big overhead from many files opened simulatenously and their headers in memory. Probably best is to create a multi-sample bam. Not necessarily all 60k samples in one, maybe 30 bams with 2k samples? You'll need to run some benchmarks.

pd3 avatar Mar 23 '20 15:03 pd3

If I skip all indels, just SNPs, the option c will be ok? With 2k samples in each bam which is 30 bams all together, how do I run? Give the list of 30 bams as in option a?

sandeshsth avatar Mar 23 '20 17:03 sandeshsth

Provided the reads are correctly annotated with their readgroup, you can pass the list of multi-sample BAMs as usual via mpileup --bam-list list.txt. If you don't care about indels, adding the mpileup -I option will speed up the calling.

pd3 avatar Mar 24 '20 09:03 pd3

I work with thousands of low-coverage samples (~5000), and I find merging into a single BAM and calling much faster to work with than calling from a list of bams.

I have found the sinto package in python useful for manipulating read groups and read tags in a single multi-sample bam file.

winni2k avatar Jul 03 '20 07:07 winni2k

Also, if you want the traditional text-pileups for a multi-sample bam, then there's always streaming-pileupy.

winni2k avatar Jul 03 '20 08:07 winni2k

"Provided the reads are correctly annotated with their readgroup": I am wondering about the best practice of annotating reads. I read some Q&A and got more confused.

  1. I have 1000 samples multiplexed together and sequenced in a lane. So, is this a better way of annotating read group ID? I am thinking ID: should be unique so that each sample can be distinguished.
Header:
@A00740:74:H2MWHDSXY:1:1101:16514:1438 1:N:0:1

sample1
@RG     ID:H2MWHDSXY.1.sample1 SM:sample1   PL:illumina     LB:Lib1      PU:H2MWHDSXY.1

sample2
@RG     ID:H2MWHDSXY.1.sample2 SM:sample2   PL:illumina     LB:Lib1      PU:H2MWHDSXY.1
  1. The sample1 is replicated in lib2 and sequenced in a different lane. I want them merged together. So this will merge sample1 together (with previous one) in the VCF?
Header:
@MG01HX01:973:H7NJNCCX2:4:1101:23378:2012 1:N:0:1

Replicated sample1 
@RG     ID:H7NJNCCX2.4.sample1 SM:sample1   PL:illumina     LB:Lib2      PU:H7NJNCCX2.4

sandeshsth avatar Jul 16 '20 16:07 sandeshsth

Hi @sandeshsth, I find read groups confusing too. The tools I have encountered generally use the SM tag to refer to biological samples and the ID tag to distinguish reads coming from different lanes (but perhaps still for the same biological sample). I find this post helpful in understanding the difference: https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups

Also, why does the ID ID:H2MWHDSXY.1.sample2 SM:sample1 in your first example contain both the string sample2 and sample1?

winni2k avatar Jul 16 '20 16:07 winni2k

Hi @winni2k, Thanks for the response. Should be ID:H2MWHDSXY.1.sample2 SM:sample2. I will change it.

Actually, I was reading the same thread and Q&A.

It looks like GATK uses ID to identify reads of a particular sample and assign it to SM. There is an answer from Admin: "For your definition, the ID fields are identical for your tumor and normal samples and when the files are merged for joint analysis, our tools are then unable to distinguish reads from the two different files."

In your case, it looks like SM will identify individual samples even if ID is same such as ID:H2MWHDSXY.1 for all multiplexed samples in a lane. I am wondering how reads are differentiated as ID is present as RG:Z:H2MWHDSXY.1?

sandeshsth avatar Jul 16 '20 17:07 sandeshsth

I used Bcftools to call SNPs in 4 samples:

  1. input individual bams as a list took 49 min and 46 sec.
  2. 4 bams were merged into a single file with Samtools and called SNPs, which took 48 min 15 sec.

I was wondering why the speed is not much different between merged vs list of bams. Is there anythig that I missed?

sandeshsth avatar Jul 17 '20 01:07 sandeshsth

I used Bcftools to call SNPs in 4 samples:

1. input individual bams as a list took 49 min and 46 sec.

2. 4 bams were merged into a single file with Samtools and called SNPs, which took 48 min 15 sec.

I was wondering why the speed is not much different between merged vs list of bams. Is there anythig that I missed?

A small gain in time looks reasonable to me with only 4 bams. Speed improvements aren't visible until you are using perhaps hundreds of bams or more.

winni2k avatar Jul 17 '20 08:07 winni2k

I tried with 2,474 Bams:

  1. Bcftools with Bam list took 5 hrs 28 min

  2. samtools merge to get a single bam took 1 day and six hours + Bcftools took 2 hrs 43 min

It shows that running a single bam is faster but to generate merged bam is taking a lot of time. Is there any way to speed the merging step?

sandeshsth avatar Jul 21 '20 02:07 sandeshsth

samtools merge to get a single bam took 1 day and six hours

Which version of samtools are you running and how large (number of header lines, number of records) are the BAM files?

valeriuo avatar Jul 21 '20 06:07 valeriuo

It shows that running a single bam is faster but to generate merged bam is taking a lot of time. Is there any way to speed the merging step?

In principle, yes: One can run multiple merge commands in parallel on subsets of bams. I have seen the recommendation to merge 500 bams at a time, and then to merge the resulting merged bams. With 2,474 bams that should be five sets of around 500 bams. This is a bit of a hastle to implement in a pipeline though, and merging of bams is something I only do once or twice, so a wait time of a day has been fine with me and I have never tried it.

winni2k avatar Jul 21 '20 07:07 winni2k

@valeriuo Command used (samtools version 1.10): ~/bin/samtools_1.10/bin/samtools merge --no-PG -b 2474_bams_list.txt merged.s.bam

There are 28 headers and ~2 million records in each file. The average file size is 172 MB. The total size of 2,474 files is 426 GB. A single merged file is 289 GB. I thought --no-PG will avoid @PG in the merged file but they are still there, I am not sure why it is not working.

@winni2k Actually, I have ~70000 samples to process. So, I am trying to find the fastest way possible. The next, I will get ~10k more to process and combine with ~70k file. I will have around a week to process the second part.

sandeshsth avatar Jul 21 '20 14:07 sandeshsth