speed up / parallelize `compare_taxonomy` step

Open ctb opened this issue 5 years ago • 1 comments

right now the biggest bottleneck in charcoal is compare_taxonomy, which does the summarization. It iterates over all of the input genomes and computes summary statistics for each one, and then outputs two big spreadsheets. So, while all of the steps before (and after) that can work on individual genomes, compare_taxonomy looks at all of them and bottlenecks for large collections.

In early code, just_taxonomy dealt with this issue by outputting a single line for a genome, and then a separate step combined this into one honkin' big CSV. We could do the same here, except...

...at least for now, the taxonomy summary might still require an all by all computation.

Also, the just_taxonomy approach ended up being harder to modify and refactor after the fact, so I don't want to jump to it necessarily.

Sep 11 '20 13:09 ctb

idle thinking, since (for at least some large data sets) there are many genomes that have no observable contamination or can't be processed, perhaps we can develop a two stage process that lets us iterate faster on just the genomes that need contamination analysis.

Sep 13 '20 16:09 ctb