mokapot icon indicating copy to clipboard operation
mokapot copied to clipboard

WIP: intermediary file format specification

Open gessulat opened this issue 1 year ago • 0 comments

@sambenfredj 's pull requests introduces streaming at several places of the workflow but those intermediary file formats are not specified and documented yet. In addition, switching to a binary format such as partitioned pyarrow datasets would speed up IO.

Schemas will be defined here after @wfondrie 's switch to polars.

Tasks

  • [ ] document where we need intermediary files
  • [ ] document how the files relate to input files, to each other, and to output files (e.g. how should they be joined?)
  • [ ] specify columns and their datatype and potential indeces on columns

gessulat avatar Sep 21 '23 09:09 gessulat