noodles icon indicating copy to clipboard operation
noodles copied to clipboard

noodles implementation is slower than rust-htslib, did I do something suboptimal?

Open bwlang opened this issue 7 months ago • 4 comments

rust-htslib:

cat   0.00s user 0.01s system 1% cpu 0.568 total
sudo cargo run --release -- > tandem_reads.bam  0.43s user 0.04s system 82% cpu 0.570 total

noodles:

cat   0.00s user 0.01s system 0% cpu 1.192 total
sudo cargo run --release -- > tandem_reads.bam  1.04s user 0.05s system 91% cpu 1.195 total

i first implemented with htslib, then tried switching to noodles with no architectural changes. image

for noodles i'm opening like this:

    let mut bam_reader = bam::reader::Builder.build_from_path(&opt.input_bam_file)?;
    let header = bam_reader.read_header()?;
    let mut bam_writer = bam::writer::Builder.build_from_path(&opt.bam_output_file)?;
    for r in bam_reader.records(&header) {
        ...
    }

for htlslib like this:

    let mut bam = bam::Reader::from_path(&opt.input_bam_file).unwrap();
    let header = bam::Header::from_template(bam.header());
    let mut bam_out = bam::Writer::from_path(&opt.bam_output_file, &header, bam::Format::Bam).unwrap();
    for r in bam.records() {
        ...
    }

The internal filtering logic is the same , reasoning about flags insert size , etc - writing out some records.

Any suggestions?

bwlang avatar Nov 26 '23 02:11 bwlang

this is on a macbook arm64 computer.

bwlang avatar Nov 26 '23 12:11 bwlang

noodles tends to do more record validation and tries to ensure data is spec compliant. However, in your flamegraphs, it seems the DEFLATE implementation htslib linked to is using SIMD intrinsics (see inflate_fast_neon and longest_match_neon) on your CPU. By default, noodles uses the pure Rust implementation miniz_oxide, which can't compete in this case.

Can you retime your tests using libdeflate for both? I.e., cargo add noodles-bgzf --features libdeflate for noodles and cargo add rust-htslib --features libdeflate for rust-htslib. libdeflate is much faster for this application, regardless, and it would be better than comparing differing DEFLATE implementations.

zaeleus avatar Nov 26 '23 18:11 zaeleus

with htslib libdeflate:

cat   0.00s user 0.01s system 1% cpu 0.574 total
sudo cargo run --release -- > tandem_reads.bam  0.45s user 0.03s system 83% cpu 0.577 total

with noodles

cat   0.00s user 0.01s system 1% cpu 0.740 total
sudo cargo run --release -- > tandem_reads.bam  0.61s user 0.05s system 89% cpu 0.743 total
image

much closer... thanks for the suggestion! Anything else I should try?

bwlang avatar Nov 26 '23 19:11 bwlang

If you don't need owned records, you can reuse the record buffer.

let mut record = sam::alignment::Record::default();

while bam_reader.read_record(&header, &mut record)? != 0 {
    // ...
}

When doing this, you may also want to use Reader::rc_records in your rust-htslib test for a more fair comparison.

zaeleus avatar Nov 26 '23 19:11 zaeleus

noodles 0.61.0 now allows alignment format records to read and written, but I'm not sure if this will help in a passthrough case like in your tests. Feel free to open a new discussion if you have any further questions.

zaeleus avatar Jan 25 '24 17:01 zaeleus