htsjdk
htsjdk copied to clipboard
BAM async compression could use multiple threads
Right now, only 1 thread per file is used. Compression easily has enough work for multiple threads and it would speed up the runtime
I think the real challenge here is how to control the level of thread creation, and how to handle situations where you are writing many BAM files (e.g. think of Picard's IlluminaBasecallsToSam
which does demultplexing to BAM).
In an ideal world there could be a pool of threads whose size is configurable/controllable, that is used for all BAM compression and writing within a VM (or factory instance etc.). E.g. I'd love to be able to specify -Dsamjdk.bam_write_threads=8
, and have it use 8 threads whether I'm writing 1 or 96 BAM files.
agreed. A fixed-size pool of threads for all our bam needs (#482 shows that reading also benefits from multiple threads though it would probably not need more than 1 (?))
Sounds like a singleton blocking queue of BAM blocks. I've added a similar functionality in cramtools recently. I think this should be generalized so that the same mechanism could be used for CRAM etc.
Sorry to rather resurrect an ancient issue, but I see it's still open so perhaps this is better than creating a new one.
I am in the process of evaluating decode and encode speeds for various tools and so far it seems I get absolutely no gain with asynchronous I/O in htsjdk. Firstly, am I using it correct? I'm a java novice, but I have a basic test tool (below) and are using the Java -Dvar=value options to set the defaults. However it doesn't appear to validate these so whatever I type in is swallowed. How do I know if I'm setting the correct variables? Is there a verbose option somewhere?
My command is:
time java -Dhtsjdk.samtools.USE_ASYNC_IO_READ_FOR_SAMTOOLS=true -Dhtsjdk.samtools.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS=true -cp picard.jar:. TestBamReadWrite.java ../novaseq-10m.bam /tmp/out.bam
(Sorry, browser failure - it randomly posted my message mid writing. Will edit)
Edit: clearly not those -D options as setting COMPRESSION_LEVEL (as listed in https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/Defaults.html) has no impact. How do I find out the namespace for these configuration variables please?
import htsjdk.samtools.DefaultSAMRecordFactory;
import htsjdk.samtools.SAMFileWriter;
import htsjdk.samtools.SAMFileWriterFactory;
import htsjdk.samtools.SAMRecord;
import htsjdk.samtools.SamInputResource;
import htsjdk.samtools.SamReader;
import htsjdk.samtools.SamReaderFactory;
import htsjdk.samtools.ValidationStringency;
import htsjdk.samtools.seekablestream.SeekableStream;
import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
public class TestBamReadWrite {
public static void main(String[] args) throws IOException {
SamReader reader = SamReaderFactory
.makeDefault()
.validationStringency(ValidationStringency.SILENT)
.open(new File(args[0])); // input
final SAMFileWriter writer = new SAMFileWriterFactory()
.makeWriter(reader.getFileHeader(),
true,
new File(args[1]), // output
new File(args[2])); // reference
for (SAMRecord record : reader) {
writer.addAlignment(record);
}
reader.close();
writer.close();
}
}
I realise I can add ".setUseAsyncIo(true)" in my reader and writer creation. Sadly it's pretty much negligible.
Can I confirm please that there is no true multi-threaded decompression or compression capability yet?
@jkbonfield It's super confusing... the namespace is samjdk
and they're all lowercase. (ex:
-Dsamjdk.use_async_io_read_samtools=true
I don't know why, it's a historical artifact and one that really needs documentation somewhere.
You can assert that the values are getting set by checking the actual value of the variable Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS
in your code (or looking at the results of Defaults.allDefaults()
There is currently a single additional core available when compressing / decompressing using the async options. I've been under the impression that it had a fairly significant improvement on write speed and a fairly minor one on read.
There's a long overdue to be merged pr #1249 which offers true multithreaded compression which should help.
Thanks for the answer and for the heads up on 1249. Great to hear something is coming along.
Any update?