taxonium 4Mb genome, many mutations: first line of .jsonl.gz becomes prohibitively long with amino acid changes

4Mb genome, many mutations: first line of .jsonl.gz becomes prohibitively long with amino acid changes

Open AngieHinrichs opened this issue 4 months ago • 25 comments

Hi Theo - my group has a tree of 127k M. tuberculosis genomes, 212k nodes. The M.tb genome is 4.4Mb and there are many mutations in the tree. With nucleotide mutations only, the first line of the .jsonl.gz when decompressed is ~263MB. At that size, the tree takes a few minutes to load on a MacBook Pro M2 Max with 64GB RAM. It takes ~10 minutes to load on a MacBook Pro M2 with 16GB RAM (long enough for a PI to get tired of waiting and go do something else 🙂).

However, when usher_to_taxonium is run with --genbank and amino acid changes are added, the first line when decompressed is ~1.1GB and something in the taxonium app's back end dies with this error:

sending message
stderr: file:///Applications/Taxonium.app/Contents/Resources/app/node_modules/taxonium_data_handling/importing.js:62
    cur_line += data.toString();
                     ^

RangeError: Invalid string length
    at Gunzip.<anonymous> (file:///Applications/Taxonium.app/Contents/Resources/app/node_modules/taxonium_data_handling/importing.js:62:22)
    at Gunzip.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:324:12)
    at readableAddChunk (node:internal/streams/readable:297:9)
    at Readable.push (node:internal/streams/readable:234:10)
    at Zlib.processCallback (node:zlib:566:10)

Node.js v18.12.1

Then the UI just freezes and never finishes loading.

So for now we'll do without the amino acid changes, and go do something else while the nuc-only version loads. But we were hoping you'd have some ideas about how to magically speed up the initial load when there are so many mutations. 🙂

I can share the tree files offline if you would like to test them out on your end.

Sep 30 '24 20:09 AngieHinrichs

taxonium taxonium copied to clipboard

4Mb genome, many mutations: first line of .jsonl.gz becomes prohibitively long with amino acid changes

taxonium
taxonium copied to clipboard