Investigate granges rust crate
Investigate granges rust crate for allowing GenomicRanges like functionality: https://github.com/vsbuffalo/granges
Thanks! I'm aware of it... I recently added a page to the docs showing how to work with the genomicranges package, but it's unlikely to be as fast as granges.
If you have a sec, would you be able to briefly sketch out the API you'd imagine? Are these like methods on the result object, something else?
Also, FWIW, you can recreate a lot of granges like functionality with SQL if you're comfortable with it.
Yes, I saw the genomicranges integration, but it is very slow if you have a lot of intersections (e.g. intersecting a bed file with a bigwig (yesterday more than 1 hour, with latest version of iranges 15minutes): https://github.com/BiocPy/GenomicRanges/issues/98
With my own implementation based on Polars and ncls (intersect library behind pyranges) it takes less than 17 seconds.
Also, FWIW, you can recreate a lot of granges like functionality with SQL if you're comfortable with it.
When you have a lot of intersections, this will likely be slow if you don't use specific structures that can handle intervals efficiently.
If you have a sec, would you be able to briefly sketch out the API you'd imagine? Are these like methods on the result object, something else?
I didn't look closely at it yet. So no idea at the moment.
Another similar crate: https://github.com/noamteyssier/bedrs
Thanks yeah, I agree with all of that.
My initial thought is similar to how biobear works with VCF/BAM indices, as I'd ideally want it to be compatible with SQL then expose a more pythonic API on top of it.
but it's unlikely to be as fast as granges.
Definitely not going to reach rust like speeds in Python :)
Our focus initially has been to bring Bioconductor-like representations to Python. Its time I find some focus time and optimize the methods that were implemented.
I know this is long overdue, but we recently got around to optimizing several overlap and search queries. The test case posted in https://github.com/BiocPy/GenomicRanges/issues/98 now takes ~6 seconds! (ref: https://github.com/BiocPy/GenomicRanges/pull/152)
Or you can take a look at https://biodatageeks.org/polars-bio/ which is also a robust and and scalable option!
@jkanche It might also be worth to take a look at https://github.com/pyranges/ruranges which backs the second iteration of pyranges (https://github.com/pyranges/pyranges_1.x). With ruranges you don't need to split intervals by chromosome manually as it is handled internally.
@mwiewior I would love if you can rerun some of your benchmarks since it uses a super old version of genomicranges - https://github.com/biodatageeks/polars-bio/issues/156
@ghuls, Did not know about ruranges and thank you for pointing me to this, looks very cool.
Our implementation also handles the chromosomes internally and is pretty fast at handling millions of intervals. While my initial implementations have been sloppy, we've made great progress on the performance lately. Hope youll give it a try!