libStatGen icon indicating copy to clipboard operation
libStatGen copied to clipboard

Parallel BGZF

Open daheise opened this issue 7 years ago • 6 comments

Hello,

Our team has been using Minimac4 to do imputation work. We have found that this program utilizes the InputFile from libStatGen to do BGZF I/O.

It seems that libStatGen would benefit from updates to the samtools and/or htslib libraries which include support for parallel BGZF I/O. Has any work or consideration been given to this feature enhancement?

daheise avatar Sep 25 '18 13:09 daheise

That would be a great addition. Would consider contributing it?

G

Sent from my iPhone

On Sep 25, 2018, at 9:31 AM, David Heise [email protected] wrote:

Hello,

Our team has been using Minimac4 to do imputation work. We have found that this program utilizes the InputFile from libStatGen to do BGZF I/O.

It seems that libStatGen would benefit from updates to the samtools and/or htslib libraries which include support for parallel BGZF I/O. Has any work or consideration been given to this feature enhancement?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

abecasis avatar Sep 26 '18 04:09 abecasis

I have started a work in progress branch in my fork here. However, I may need help with change to build process for libStatGen; I am rusty on CXX build tools. While my WIP assumes the samtools subdirectory is samtools 1.9 with samtools built, and samtools assumes a htslib 1.9 subdirectory, I have not included those changes in my WIP branch at this time.

A few challenges are the following.

  1. libStatGen currently includes a snapshot of samtools from some point in time. I don't know if you want to simply embed a newer version of samtools (v1.9 is assumed in my WIP) and the corresponding htslib within this repository, or use some other strategy such as submodules.

  2. Version 1.9 of samtools/htslib separates out some functionality that used to be all in samtools. In other words, samtools depends on htslib, and libStatGen will now depend on samtools and htslib. If libStatGen is meant to be a fully self-contained archive, changes to the way the archive is built are needed due to the change in directory caused by the upgrade to samtools/htslib 1.9.

  3. Minimac4 is the program we are targeting. As I have it working on my development machine, I have to link libStatGen, samtools, and htslib to Minimac4 in order to resolve all the symbols used in libStatGen.

daheise avatar Sep 26 '18 14:09 daheise

Regarding the strategy for how to handle dependencies on samtools and htslib, in my revision as it currently stands I have external dependencies rather than embedded in libStatGen. I did encounter an issue that samtools does not install libraries and headerfiles in system locations. When I brought this up as a possible change to samtools, they mentioned that samtools is largely deprecated in favor of htslib.

I only found one usage of libbam from samtools, the function bam_reg2bin. As seen here the samtools team suggests a similar call from htslib. Unfortunately, I do not have the expertise in samtools or libStatGen to know if this may have unexpected side effects. Are you able to provide any advice or testing?

daheise avatar Sep 28 '18 19:09 daheise

Some thoughts:

  1. I doubt this proposed update is going help much with Minimac4. In my experience, the biggest IO bottleneck that exists in Minimac4 is from writing temp files, not reading input. This is mostly because the temp files are VCF (text based) and involve a lot of printf statements. Writing out compressed binary hap dosages to temp files would be much faster.

  2. If you really want to do this, there is an experimental libStatGen branch called cram-support that already leverages htslib and addresses the duplicate symbol errors that occur.

  3. A more forward looking IO port for Minimac4 would be to replace libStatGen with Savvy (https://github.com/statgen/savvy).

  4. You can get better performance out of Minimac4 with large sample sizes without modifying source. The solution is to change your execution design: decrease the number of threads you are using and increase the number of processes. In other words, partition the input files into smaller groups of samples and run the sample groups in parallel. Then merge the output files with bcftools.

jonathonl avatar Sep 28 '18 21:09 jonathonl

I've been doing some work and experiments before coming back to this.

(1) We have noticed the same. Minimac4 does spend a fair amount of time doing in-memory singlethreaded printfs. However, we do get a non-trivial speedup from multithreaded bgzf writes to disk, especially with very large settings of printBuffer (2GB).

(2) Thank you for pointing out this branch. I will check it out. My work so far has only focused on the bgzf portion of htslib, not the other changes it brings to the table.

(4) That was our first strategy. We have pushed it to the limit for our use case and are looking for more.

daheise avatar Oct 09 '18 19:10 daheise

@abecasis I have submitted a PR for this feature here: #21

daheise avatar Oct 12 '18 17:10 daheise