charcoal
charcoal copied to clipboard
What should we benchmark?
Following the same process as sourmash-bio/sourmash#2410, we will benchmark the charcoal workflow with the demo directory and/or the six signatures included in sourmash-bio/sourmash#2410. Suggesting to:
- Run the demo repo
- Run each sequences alone
- Run a variety of sequences from small to large sets
- Run the all six together
It may be interesting to also compare the results of sourmash search --containment to `charcoal.contigs_list_contaminents.py in this repo.
It would also be interesting to compare the accuracy of sourmash gather and genome-grist MinSetCov taxonomic outputs with and without charcoal.
It sounds like you might be trying to benchmark both computational performance and classification performance. Those are pretty different things.
I don't think that charcoal has any individually expensive steps or computationally complex scripts that are part of it; it's just the workflow overall that involves an awful lot of steps, much like genome-grist. That may change your benchmarking strategy.
I agree! They are completely different benchmarks. Mostly wanted to jot down the notion before it left me forever. Additionally, since we will be writing an analytical benchmark for computational results, we will have a foundation to come back to in the future when we are ready for a biological benchmark.
Would you suggest forking and adding benchmark directives throughout the snakefile instead of a global benchmark?