research-template
research-template copied to clipboard
Use arrow format
Feather format is smaller than CSV, i.e. more efficient on space/processing, and stores dtypes, helping to avoid some problems when loading the data for further processing.
We initially moved to .csv.gz
, which was an improvement on uncompressed CSVs. However, it uses a significant amount of CPU. We believe that moving to Arrow/Feather would use much less CPU and be an overall improvement.
To do:
- [ ] Update project.yaml and code sample
- [ ] Update gitignore to ignore
.feather
/.arrow
files - [ ] Update docs, including Getting Started Guide and ehrql tutorials (https://github.com/opensafely/documentation/issues/1610)
- [ ] Ensure that arrow files can be viewed using Codespaces
- [ ] Provide researchers with instructions about how to view arrow files during local development, for researchers using VS Code, R Studio and the Stata IDE (https://github.com/opensafely-core/opensafely-cli/issues/267) (Added by Lucy)