biobear
biobear copied to clipboard
How to run polars dataframe methods on large FASTQ files in a memory efficient way
As I mentioned in this comment, I'm interested in applying "groupby" function on polars dataframe. My goal is counting all unique sequences in a FASTQ file. I have a few questions:
- How can I take advantage of
biobear
to read large FASTQ files without loading the whole data into memory? - How can I use multiprocessing to accelerate the computation time? e.g. https://docs.pola.rs/user-guide/misc/multiprocessing/
Thanks for you time.
related https://github.com/wheretrue/biobear/issues/89 and https://github.com/ArcInstitute/ScreenPro2/issues/28