[PERF] Add a parallel local CSV reader
Adds a parallel CSV reader to speed up ingestion of CSV. The approach adapts some ideas laid out in [1], but the majority of performance gains came from the use of buffer pools to minimize memory allocations.
Some performance numbers
We consider a simple case of reading and performing .collect() on a CSV file with 10^8 rows of 9 fields: 3 string fields, 5 int64 fields, and 1 double field. This file is roughly 5GB in size.
Non-native executor: 38.71140212500177s
Non-native executor, new CSV reader: 7.432862582994858s
Native executor: 44.55550079200475s
Native executor, new CSV reader: 4.117344291880727s
This represents a roughly 10x speedup on CSV reads for the native executor.
Followup work
- The schema is currently taken from the convert options, but we should also perform schema inference.
- We need to add better estimators for record size, either via sampling, or by keeping track of stats as we go.
- Currently, for each read, the reader creates a buffer pool for reading CSV records plus a pool of slabs for reading the CSV file. We might need to change these to per-process pools to avoid high memory pressure on concurrent reads. However care must be taken otherwise it's possible for use to deadlock.
[1]: Ge, Chang et al. “Speculative Distributed CSV Data Parsing for Big Data Analytics.” Proceedings of the 2019 International Conference on Management of Data (2019).
CodSpeed Performance Report
Merging #2772 will degrade performances by 33.78%
Comparing desmondcheongzx:local-csv-reader-experiment (7b40f23) with main (cad9168)
Summary
⚡ 2 improvements
❌ 1 regressions
✅ 13 untouched benchmarks
:warning: Please fix the performance issues or acknowledge them on CodSpeed.
Benchmarks breakdown
| Benchmark | main |
desmondcheongzx:local-csv-reader-experiment |
Change | |
|---|---|---|---|---|
| ❌ | test_count[1 Small File] |
16.5 ms | 24.9 ms | -33.78% |
| ⚡ | test_explain[100 Small Files] |
52.8 ms | 39.9 ms | +32.33% |
| ⚡ | test_show[100 Small Files] |
597.4 ms | 355.8 ms | +67.91% |
This branch is a little borked. Keeping a record of it but reopening the PR at https://github.com/Eventual-Inc/Daft/pull/3055