augur icon indicating copy to clipboard operation
augur copied to clipboard

ENH(curate): Support zstd compressed fasta, metadata and/or ndjson

Open corneliusroemer opened this issue 1 year ago • 2 comments

Context

Just trying out augur curate. First painpoint is that it doesn't seem to support zst compressed fasta (at least according to docs).

Description

Support zst compressed input metadata and fasta, and maybe also ndjson to reduce storage requirements.

Any thoughts @joverlee521?

corneliusroemer avatar May 16 '23 12:05 corneliusroemer

Getting an error (that incidentally should also be handled better)

$ augur curate passthru --fasta data/gisaid.fasta.zst --metadata data/gisaid_metadata.tsv --seq-field seqfield --seq-id-column strain | zstd -c > data/gisaid.ndjson.zst
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/nextstrain/lib/python3.10/site-packages/augur/__init__.py", line 66, in run
    return args.__command__.run(args)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/nextstrain/lib/python3.10/site-packages/augur/curate/__init__.py", line 188, in run
    dump_ndjson(modified_records)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/nextstrain/lib/python3.10/site-packages/augur/io/json.py", line 63, in dump_ndjson
    for item in iterable:
  File "/opt/homebrew/Caskroom/miniforge/base/envs/nextstrain/lib/python3.10/site-packages/augur/curate/passthru.py", line 14, in run
    yield from records
  File "/opt/homebrew/Caskroom/miniforge/base/envs/nextstrain/lib/python3.10/site-packages/augur/io/metadata.py", line 278, in read_metadata_with_sequences
    sequences = pyfastx.Fasta(fasta)
RuntimeError: data/gisaid.fasta.zst is not plain or gzip compressed fasta formatted file


An error occurred (see above) that has not been properly handled by Augur.
To report this, please open a new issue including the original command and the error above:
    <https://github.com/nextstrain/augur/issues/new/choose>

corneliusroemer avatar May 16 '23 13:05 corneliusroemer

The command should be able to accept zstd compressed metadata since we've add the extra zstd dependency for xopen.

The NDJSONs are expected to be streamed to the command, so this should work:

zstdcat file.ndjson.zst | augur curate ...

The FASTA file is currently limited by the pyfastx library that we use for random access of sequences. I created an issue to ask for xz support where you also asked for zst support 😄

The author's response is the not the most enthusiastic, so we may have to extend the library ourselves. I have no experience with C so I will probably need extra time/help here. It is also not clear how much of a priority to make this as we are still in the process of building out the subcommands.

joverlee521 avatar May 17 '23 17:05 joverlee521