biobear icon indicating copy to clipboard operation
biobear copied to clipboard

How to run polars dataframe methods on large FASTQ files in a memory efficient way

Open abearab opened this issue 1 year ago • 5 comments

As I mentioned in this comment, I'm interested in applying "groupby" function on polars dataframe. My goal is counting all unique sequences in a FASTQ file. I have a few questions:

  1. How can I take advantage of biobear to read large FASTQ files without loading the whole data into memory?
  2. How can I use multiprocessing to accelerate the computation time? e.g. https://docs.pola.rs/user-guide/misc/multiprocessing/

Thanks for you time.


related https://github.com/wheretrue/biobear/issues/89 and https://github.com/ArcInstitute/ScreenPro2/issues/28

abearab avatar Feb 27 '24 02:02 abearab