serratus
serratus copied to clipboard
Update benchmarking script with new CLI
Opening this PR to track progress for #27. Keeping as draft until ready.
TODO:
- split into 2 scripts
-
cov_benchmark.py
- [x] design high-level CLI
- [x] implement simulation parameters
- [x] implement alignment parameters
- [x] implement custom subprocess parameters (
msbar
,art_illumina
,bowtie2
) - [x] convert counts to proportions
-
pangenome_cov_benchmark.py
- [x] per genome in pangenome: generate single-record FASTA, run
cov_benchmark.py
with predefined parameters - [x] aggregate results in
.csv
file
- [x] per genome in pangenome: generate single-record FASTA, run
-
- [x] update README
- [ ] write containerized unit tests (?)
PR closes #27
I was able to do most of the finalizing work for this today. A few more unit tests that I'm figuring out how to implement since it doesn't seem like msbar
supports a random seed to generate reproducible output.
The updated README is mostly technical without much insight on the biological context; @ababaian if you have any comments/suggestions feel free to modify directly or let me know!
Hey @victorlin, can you remove the fq files from the repo and instead host them on the s3 bucket. There is a test_data
folder which these would fit in.
--
I'm a bit tied up at the moment, I'll do a proper review tomorrow
@ababaian done - moved all the test input files to s3://serratus-public/test-data/benchmarker/
. Also added a setup.sh
script to copy the files into test/benchmarker/local/
. Using local
for the folder name since it's targeted by our repo's .gitignore.
A lot has happened since this was started. Is the benchmarker still needed?
The last task is unit testing msbar
, but it seems non-trivial. We could merge this PR in and add a separate task, or just close out to make room for other exciting things.
IMO we don't need a mapping benchmark for the main Serratus search, we have settled on how to run bowtie2 and are not likely to revisit benchmarking of mappers.
Ironically enough I do think we /should/ revisit this as there will be ways to improve on --very-sensitive-local
I just think on the totem pole of priorites this has gotten bumped down quite a bit as more pressing things always seem to be there. I'd say let's leave this open and swing back to it when time is a luxury we have.