Minimac4
Minimac4 copied to clipboard
Minimac I/O Bounds
We are using minimac4 to impute >300k samples and are finding that on our compute resources the runtime of minimac4 is dominated almost entirely by the writing of temporary VCF files and the append step at the end. We are trying to find a way to mitigate this performance bottleneck. We have attempted the following strategies.
-
We increased
--vcfBuffer
to 2000 we found that this reduced our time spent in I/O by about 30%, but doubled the memory consumption. -
We found that output during the final "append" was CPU bound in our environment. File IO appears to be single threaded, and the CPU was at 100% but disk IO was light. We tried using the
--nobgzip
option then used the bgzip tool to accomplish the final compression. However, we found that--nobgzip
only wrote the final output in an uncompressed format. The runtime remained CPU-bound by the single-threaded (de)compression of temporary VCF files. The total time spent in IO was essentially unaffected by the change.bgzip
did, however, accomplish the compression of the final file in just a few minutes. -
We made a small local change to this region of minimac code to write out temporary files uncompressed. This changed the problem from being CPU bound to IO bound. This seemed to save us ~2k seconds according to the "Imputation successful !!!" line. However, the append did not succeed due to an out of memory error; it appears minimac attempts to read in all of the temporary VCF files and then runs out of memory prior to writing any data.
Do you have any suggestions for how to increase the IO performance of minimac4? Is it possible to do any parallel IO?
See point 4 in https://github.com/statgen/libStatGen/issues/20#issuecomment-425575055.
I've continued working on trying to eek out additional parallelization from Minimac. For this code region, I think it is safe to split this into two for loops, with the read being made OMP parallel, and the write being a separate single threaded loop. Like so:
vector<string> lines(MaxIndex);
#pragma omp parallel for
for(int j=1;j<=MaxIndex;j++)
{
lines[j-1].clear();
vcfdosepartialList[j-1]->readLine(lines[j-1]);
}
for(int j=1;j<=MaxIndex;j++){
VcfPrintStringLength+=sprintf(VcfPrintStringPointer + VcfPrintStringLength,"%s",lines[j-1].c_str());
}