mokapot
mokapot copied to clipboard
WIP: intermediary file format specification
@sambenfredj 's pull requests introduces streaming at several places of the workflow but those intermediary file formats are not specified and documented yet. In addition, switching to a binary format such as partitioned pyarrow datasets would speed up IO.
Schemas will be defined here after @wfondrie 's switch to polars.
Tasks
- [ ] document where we need intermediary files
- [ ] document how the files relate to input files, to each other, and to output files (e.g. how should they be joined?)
- [ ] specify columns and their datatype and potential indeces on columns