htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

BAM async compression could use multiple threads

Open akiezun opened this issue 8 years ago • 7 comments

Right now, only 1 thread per file is used. Compression easily has enough work for multiple threads and it would speed up the runtime

akiezun avatar Apr 08 '16 13:04 akiezun

I think the real challenge here is how to control the level of thread creation, and how to handle situations where you are writing many BAM files (e.g. think of Picard's IlluminaBasecallsToSam which does demultplexing to BAM).

In an ideal world there could be a pool of threads whose size is configurable/controllable, that is used for all BAM compression and writing within a VM (or factory instance etc.). E.g. I'd love to be able to specify -Dsamjdk.bam_write_threads=8, and have it use 8 threads whether I'm writing 1 or 96 BAM files.

tfenne avatar Apr 08 '16 13:04 tfenne

agreed. A fixed-size pool of threads for all our bam needs (#482 shows that reading also benefits from multiple threads though it would probably not need more than 1 (?))

akiezun avatar Apr 08 '16 14:04 akiezun

Sounds like a singleton blocking queue of BAM blocks. I've added a similar functionality in cramtools recently. I think this should be generalized so that the same mechanism could be used for CRAM etc.

vadimzalunin avatar Apr 14 '16 13:04 vadimzalunin

Sorry to rather resurrect an ancient issue, but I see it's still open so perhaps this is better than creating a new one.

I am in the process of evaluating decode and encode speeds for various tools and so far it seems I get absolutely no gain with asynchronous I/O in htsjdk. Firstly, am I using it correct? I'm a java novice, but I have a basic test tool (below) and are using the Java -Dvar=value options to set the defaults. However it doesn't appear to validate these so whatever I type in is swallowed. How do I know if I'm setting the correct variables? Is there a verbose option somewhere?

My command is:

time java -Dhtsjdk.samtools.USE_ASYNC_IO_READ_FOR_SAMTOOLS=true -Dhtsjdk.samtools.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS=true  -cp picard.jar:. TestBamReadWrite.java ../novaseq-10m.bam  /tmp/out.bam

(Sorry, browser failure - it randomly posted my message mid writing. Will edit)

Edit: clearly not those -D options as setting COMPRESSION_LEVEL (as listed in https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/Defaults.html) has no impact. How do I find out the namespace for these configuration variables please?

import htsjdk.samtools.DefaultSAMRecordFactory;
import htsjdk.samtools.SAMFileWriter;
import htsjdk.samtools.SAMFileWriterFactory;
import htsjdk.samtools.SAMRecord;
import htsjdk.samtools.SamInputResource;
import htsjdk.samtools.SamReader;
import htsjdk.samtools.SamReaderFactory;
import htsjdk.samtools.ValidationStringency;
import htsjdk.samtools.seekablestream.SeekableStream;

import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;

public class TestBamReadWrite {

public static void main(String[] args) throws IOException {
    SamReader reader = SamReaderFactory
	.makeDefault()
	.validationStringency(ValidationStringency.SILENT)
	.open(new File(args[0]));        // input

    final SAMFileWriter writer = new SAMFileWriterFactory()
	.makeWriter(reader.getFileHeader(),
		    true,
		    new File(args[1]),   // output
		    new File(args[2]));  // reference

    for (SAMRecord record : reader) {
	writer.addAlignment(record);
    }

    reader.close();
    writer.close();
}

}

I realise I can add ".setUseAsyncIo(true)" in my reader and writer creation. Sadly it's pretty much negligible.

Can I confirm please that there is no true multi-threaded decompression or compression capability yet?

jkbonfield avatar May 27 '20 14:05 jkbonfield

@jkbonfield It's super confusing... the namespace is samjdk and they're all lowercase. (ex:

-Dsamjdk.use_async_io_read_samtools=true

I don't know why, it's a historical artifact and one that really needs documentation somewhere.

You can assert that the values are getting set by checking the actual value of the variable Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS in your code (or looking at the results of Defaults.allDefaults()

There is currently a single additional core available when compressing / decompressing using the async options. I've been under the impression that it had a fairly significant improvement on write speed and a fairly minor one on read.

There's a long overdue to be merged pr #1249 which offers true multithreaded compression which should help.

lbergelson avatar May 28 '20 16:05 lbergelson

Thanks for the answer and for the heads up on 1249. Great to hear something is coming along.

jkbonfield avatar May 28 '20 16:05 jkbonfield

Any update?

Crispy13 avatar Apr 13 '24 12:04 Crispy13