spacemake
spacemake copied to clipboard
Bam tag histogram mrfifo
Replacing dropseq-tools BamTagHistogram
This tool has become a pain-point as we now routinely have hundreds of millions of legitimate, spatial barcodes in open-st data.
This PR features a complete, drop-in replacement re-written in python using mrfifo
. It is about 10x faster and uses less RAM. I've included unit test code and also run it on some real-world data, observing identical output compared to dropseq-tools.
Unless you run into issues, I'd like to merge this into fast-cmdline (and possibly master) asap.
Best, -Marvin
Thanks so much for the amazing code, Marvin! I started testing on the tiny spatial data and Open-ST mouse hippocampus, and ran into some minor issues:
- Several dependencies missing from the environment.yaml, I will push some commits fixing this
- A file
BamTagHistogram.log
is created at the root of thespacemake
project (because of themake_minimal_parser
). Should this be created at the specific project/sample folders? - I've been running into issues with the
rerun_triggers
flag -- not all snakemake versions seem to support it (e.g., the newest we support in our environment.yaml).
As soon as the tests finish on the Open-ST data, I will report back and we can merge this into fast-cmdline and master. Also, I will run on very large Open-ST data (>10B reads) to explore the limits of the pipeline in terms of mem usage.
It worked with the tiny data, but then the out_readcounts_prealigned.txt.gz
file and others are empty with larger (real) datasets. Will investigate...
Edit: the default --min-count
argument in BamTagHistogram was just too low for the data I was using (a subset of a large sample, so no 0.6 micron spots had > 10 counts)
Tested, works as advertised :)
Maybe @nukappa can run another sample, but LGTM as soon as we address the minor points above (making sure installation works fine, and the .log
file). Also, we should consider putting --min-count 1
in both calls to BamTagHistogram
at main.smk
, otherwise it might not work well with Open-ST data
Running tests and will report soon
- [ ] Tests passed!
I added several fixes, and some optimizations to avoid using so much memory during n_intersect_sequences.py
and create_spatial_barcode_file
in main.smk
(processing reads by chunks of 100M, instead of loading all).
@nukappa This is ready to test -- if it works, we're basically ready to merge into fast-cmdline
, and this one into master
.
Thanks so much for the amazing code, Marvin! I started testing on the tiny spatial data and Open-ST mouse hippocampus, and ran into some minor issues:
* Several dependencies missing from the environment.yaml, I will push some commits fixing this
@marvin-jens will address in mrfifo
* A file `BamTagHistogram.log` is created at the root of the `spacemake` project (because of the `make_minimal_parser`). Should this be created at the specific project/sample folders?
This was fixed
* I've been running into issues with the `rerun_triggers` flag -- not all snakemake versions seem to support it (e.g., the newest we support in our environment.yaml).
This is not necessary yet
As soon as the tests finish on the Open-ST data, I will report back and we can merge this into fast-cmdline and master. Also, I will run on very large Open-ST data (>10B reads) to explore the limits of the pipeline in terms of mem usage.
Ready to test by @nukappa and merge into fast-cmdline