spacemake icon indicating copy to clipboard operation
spacemake copied to clipboard

Bam tag histogram mrfifo

Open marvin-jens opened this issue 11 months ago • 6 comments

Replacing dropseq-tools BamTagHistogram

This tool has become a pain-point as we now routinely have hundreds of millions of legitimate, spatial barcodes in open-st data. This PR features a complete, drop-in replacement re-written in python using mrfifo. It is about 10x faster and uses less RAM. I've included unit test code and also run it on some real-world data, observing identical output compared to dropseq-tools.

Unless you run into issues, I'd like to merge this into fast-cmdline (and possibly master) asap.

Best, -Marvin

marvin-jens avatar Mar 11 '24 12:03 marvin-jens

Thanks so much for the amazing code, Marvin! I started testing on the tiny spatial data and Open-ST mouse hippocampus, and ran into some minor issues:

  • Several dependencies missing from the environment.yaml, I will push some commits fixing this
  • A file BamTagHistogram.log is created at the root of the spacemake project (because of the make_minimal_parser). Should this be created at the specific project/sample folders?
  • I've been running into issues with the rerun_triggers flag -- not all snakemake versions seem to support it (e.g., the newest we support in our environment.yaml).

As soon as the tests finish on the Open-ST data, I will report back and we can merge this into fast-cmdline and master. Also, I will run on very large Open-ST data (>10B reads) to explore the limits of the pipeline in terms of mem usage.

danilexn avatar Apr 22 '24 13:04 danilexn

It worked with the tiny data, but then the out_readcounts_prealigned.txt.gz file and others are empty with larger (real) datasets. Will investigate...

Edit: the default --min-count argument in BamTagHistogram was just too low for the data I was using (a subset of a large sample, so no 0.6 micron spots had > 10 counts)

danilexn avatar Apr 22 '24 13:04 danilexn

Tested, works as advertised :)

Maybe @nukappa can run another sample, but LGTM as soon as we address the minor points above (making sure installation works fine, and the .log file). Also, we should consider putting --min-count 1 in both calls to BamTagHistogram at main.smk, otherwise it might not work well with Open-ST data

danilexn avatar Apr 22 '24 14:04 danilexn

Running tests and will report soon

  • [ ] Tests passed!

nukappa avatar Apr 23 '24 08:04 nukappa

I added several fixes, and some optimizations to avoid using so much memory during n_intersect_sequences.py and create_spatial_barcode_file in main.smk (processing reads by chunks of 100M, instead of loading all).

@nukappa This is ready to test -- if it works, we're basically ready to merge into fast-cmdline, and this one into master.

danilexn avatar Apr 23 '24 15:04 danilexn

Thanks so much for the amazing code, Marvin! I started testing on the tiny spatial data and Open-ST mouse hippocampus, and ran into some minor issues:

* Several dependencies missing from the environment.yaml, I will push some commits fixing this

@marvin-jens will address in mrfifo

* A file `BamTagHistogram.log` is created at the root of the `spacemake` project (because of the `make_minimal_parser`). Should this be created at the specific project/sample folders?

This was fixed

* I've been running into issues with the `rerun_triggers` flag -- not all snakemake versions seem to support it (e.g., the newest we support in our environment.yaml).

This is not necessary yet

As soon as the tests finish on the Open-ST data, I will report back and we can merge this into fast-cmdline and master. Also, I will run on very large Open-ST data (>10B reads) to explore the limits of the pipeline in terms of mem usage.

Ready to test by @nukappa and merge into fast-cmdline

danilexn avatar Apr 24 '24 10:04 danilexn