CITE-seq-Count icon indicating copy to clipboard operation
CITE-seq-Count copied to clipboard

Feature/cells argument

Open Hoohm opened this issue 2 years ago • 4 comments

  • [x] Rewrite data chunking
  • [x] Rewrite loading of CSVs with polars
  • [x] Rewrite Mapping in polars
  • [x] Rewrite barcode correction in polars
  • [x] Rewrite UMI correction in polars
  • [x] Rewrite fastq reading in polars
  • [x] Disambiguation of whitelist and reference.
  • [x] Generate parquet outputs
  • [x] Deprecate csv outputs

Tasks details

Rewrite UMI correction in polars

Current version of CSC uses umi_tools.network.UMIClusterer() to go through each list of UMIs per cell per feature and handles the potential UMI corrections needed. The simple implementation on polars is to use map_elements but this is not optimized as it's not using the polars infrastructure. There is a big potential for improvement if this step can be rewritten entirely in polars.

Status on branch

UMI correction is skipped at the moment, no function available.

Rewrite fastq reading in polars

Current version of CSC reads in the fastq files and then spits out a big csv which we read using polars. Fastq files are basically text files with 4 lines per read. We can rewrite the input intake to read fastq files directly and store them into a dataframe. This reduces io operations and should be faster as well. It also would allow to extend CSC to use quality to filter reads.

Status on branch

io.write_mapping_input is the function that reads the fastqs and writes the csv to be read later. Then preprocessing.split_data_input reads the csv file and generates the dataframes necessary for processing. The idea would be to skip the intermediate step by just reading the fastqs directly into the necessary dataframes.

Disambiguation of whitelist and reference

Currently CSC uses terms such as whitelist and reference to distinguish a short handpicked list from users and the whole world of barcodes. But historically, reference files also have been called whitelist and this makes it confusing. I would like to change the language to reference_subset and reference to make it clearer that the first one is a subset of the second one.

Status on branch

Delete any mention of whitelist and replace it by subset.

The two last tasks I'm going to deal with.

Hoohm avatar Jul 03 '22 13:07 Hoohm