biobear How to run polars dataframe methods on large FASTQ files in a memory efficient way

How to run polars dataframe methods on large FASTQ files in a memory efficient way

Open abearab opened this issue 1 year ago • 5 comments

As I mentioned in this comment, I'm interested in applying "groupby" function on polars dataframe. My goal is counting all unique sequences in a FASTQ file. I have a few questions:

How can I take advantage of biobear to read large FASTQ files without loading the whole data into memory?
How can I use multiprocessing to accelerate the computation time? e.g. https://docs.pola.rs/user-guide/misc/multiprocessing/

Thanks for you time.

related https://github.com/wheretrue/biobear/issues/89 and https://github.com/ArcInstitute/ScreenPro2/issues/28

Feb 27 '24 02:02 abearab

biobear biobear copied to clipboard

How to run polars dataframe methods on large FASTQ files in a memory efficient way

biobear
biobear copied to clipboard