nextclade icon indicating copy to clipboard operation
nextclade copied to clipboard

[PERF, v2] Specify Compression Level

Open fanninpm opened this issue 2 years ago • 7 comments

With Beta 7, using the built-in compression is slow when compressing to xz (using the default compression level). I'm running this in the context of a Nextstrain ncov workflow, and here is the specific rule I'm running:

My new rule, taking advantage of built-in compression
rule align:
    message:
        """
        Aligning sequences to {input.reference}
            - gaps relative to reference are considered real
        """
    input:
        sequences = lambda wildcards: _get_path_for_input("sequences", wildcards.origin),
        genemap = config["files"]["annotation"],
        reference = config["files"]["alignment_reference"]
    output:
        alignment = "results/aligned_{origin}.fasta.xz",
        insertions = "results/insertions_{origin}.tsv",
        translations = expand("results/translations/seqs_{{origin}}.gene.{gene}.fasta.xz", gene=config.get('genes', ['S']))
    params:
        output_translations = lambda w: f"results/translations/seqs_{w.origin}.gene.{{gene}}.fasta.xz",
        strain_prefixes=config["strip_strain_prefixes"],
        sanitize_log="logs/sanitize_sequences_{origin}.txt"
    log:
        "logs/align_{origin}.txt"
    benchmark:
        "benchmarks/align_{origin}.txt"
    conda: config["conda_environment"]
    threads: 12
    resources:
        mem_mb=4500
    shell:
        """
        python3 scripts/sanitize_sequences.py \
            --sequences {input.sequences} \
            --strip-prefixes {params.strain_prefixes:q} \
            --output /dev/stdout 2> {params.sanitize_log} \
            | nextalign2 run \
            --jobs={threads} \
            --reference {input.reference} \
            --genemap {input.genemap} \
            --output-translations {params.output_translations} \
            --output-fasta {output.alignment} \
            --output-insertions {output.insertions} > {log} 2>&1
        """

With my new rule, one thread is 100% loaded (handling I/O, I assume), while the other (11) are roughtly 25% loaded. This is taking more than 23 hours (with the 2022-06-20 sequence tar.xz file from GISAID), and it's still not finished. UPDATE: It finally finished, and it took 29 hours and 45 minutes.

For comparison, here is the new rule in https://github.com/nextstrain/ncov/pull/963, which I have not yet tested. (If I get a chance, I will report performance with 12 threads.)

By contrast, with the old rule using Nextalign v1, the rule has taken roughly 15.5 hours from start to finish (with the same sequence data from GISAID). I believe the main reason why this goes faster is because it uses multithreaded compression with a compression level friendlier to performance.


Do you think there could be a way to specify compression level using a CLI argument or an environment variable? xz's default compression level of 6 seems overkill for Nextstrain's use case, and that compression level hampers performance, as observed elsewhere in the Nextstrain project.

fanninpm avatar Jun 24 '22 18:06 fanninpm

@fanninpm You can always do the compression yourself if you'd like:

nextclade --output-fasta=- | xz -2 > {output.alignment}

How would you imagine a flag would look like? What would happen if there are files with different compression formats passed?

Bit in your particular experiments it is not clear if the compression is the problem here. I can see a dozen of things which may cause slowdown. CPU utilization is not a great indicator, because v2 should be much more efficient.

First of all, make sure that you are using gnu flavor of nextclade for Linux:

nextclade-x86_64-unknown-linux-gnu

and not musl flavor:

nextclade-x86_64-unknown-linux-musl

The gnu flavor is quite a bit snappier.

There were a few versions of nextstrain/base container (and ncov-ingest derived from it) which used musl flavor temporarily. It performed poorly. We switched nextstrain/base back to gnu. But ingest image is not updated yet, afaik.

ivan-aksamentov avatar Jun 24 '22 19:06 ivan-aksamentov

There were a few versions of nextstrain/base container (and ncov-ingest derived from it) which used musl flavor temporarily. It performed poorly. We switched nextstrain/base back to gnu. But ingest image is not updated yet, afaik.

I'm using the build-20220623T164743Z version of the container, as the previous container had Beta 2, which did not support transparent compression whatsoever.

How would you imagine a flag would look like? What would happen if there are files with different compression formats passed?

I imagine that for every --output-* flag, there could be a corresponding --compressionlevel-* flag. (The name of those flags is open to bike-shedding.)

Bit in your particular experiments it is not clear if the compression is the problem here. I can see a dozen of things which may cause slowdown. CPU utilization is not a great indicator, because v2 should be much more efficient.

One thing I haven't tried (that I've been itching to try) is patching the sanitize_sequences.py script to offload its decompression onto a separate process (via the xopen library).

You can always do the compression yourself if you'd like

That would be the next thing I would try (similar in spirit to nextstrain/ncov#963). Thanks for reminding me that you added the ability to pipe certain things to stdout. (As long as that's the only thing being piped to stdout — all bets are off when someone tries to pipe multiple things to stdout simultaneously. That's when I'd teach people about process substitution.)

fanninpm avatar Jun 24 '22 20:06 fanninpm

I imagine that for every --output-* flag, there could be a corresponding --compressionlevel-* flag.

That would be a lot of flags! @corneliusroemer @rneher What do you think?

@fanninpm Before we start modifying Nextclade, could you make a run of pure Nextclade (gnu version, from GitHub Releases), without python scripts, pipelines or other potential bottlenecks, and see how long it takes on your target device?

ivan-aksamentov avatar Jun 24 '22 20:06 ivan-aksamentov

@fanninpm I measured myself on a large dataset and indeed with xz outputs with compression level 6 it takes 5x time compared to level 2.

In https://github.com/nextstrain/nextclade/pull/892 I made level 2 compression default for all formats and added env vars to change levels on runtime.

It will be released in 2.0.0-beta.8 in a few moments.

ivan-aksamentov avatar Jun 25 '22 01:06 ivan-aksamentov

For maximum performance I would indeed not rely on nextclade's compression. This is more for convenience when doing mid-size analyses, I'd say.

If you want fast compression, use zstd by the way, it's much snappier than .xz.

I wouldn't add the flags, it's a bit over the top to replicate the whole tool. Compression is purely for convenience, if you need to customize, pipe into xz yourself.

I'll try compression speed myself, though.

corneliusroemer avatar Jun 27 '22 18:06 corneliusroemer

I tested a slightly modified version of the new rule in that PR (the only modification is piping the alignment via stdout to the xz command, instead of dumping the alignment to a ~300GB file). In total, the rule took almost 7 hours (a massive improvement from before!).

fanninpm avatar Jun 28 '22 14:06 fanninpm

I did quite an extensive comparison of xz and zstd at multiple compression levels on aligned and unaligned SC2 sequence data. zstd is better on sequence data in almost all ways. It's always at least 3x faster decompressing. Time for compression varies, for low compression it's much faster, for high compression it takes similar time cmpared with xz.

So if you value decompression time at all, I'd use zstd

Oh right, I see you use ncov - I guess we should switch to zstd in ncov, then 😄

corneliusroemer avatar Jun 30 '22 11:06 corneliusroemer