Daft icon indicating copy to clipboard operation
Daft copied to clipboard

[PERF] Add a parallel local CSV reader

Open desmondcheongzx opened this issue 1 year ago • 1 comments

Adds a parallel CSV reader to speed up ingestion of CSV. The approach adapts some ideas laid out in [1], but the majority of performance gains came from the use of buffer pools to minimize memory allocations.

Some performance numbers

We consider a simple case of reading and performing .collect() on a CSV file with 10^8 rows of 9 fields: 3 string fields, 5 int64 fields, and 1 double field. This file is roughly 5GB in size.

Non-native executor:                 38.71140212500177s
Non-native executor, new CSV reader: 7.432862582994858s
Native executor:                     44.55550079200475s
Native executor, new CSV reader:     4.117344291880727s

This represents a roughly 10x speedup on CSV reads for the native executor.

Followup work

  • The schema is currently taken from the convert options, but we should also perform schema inference.
  • We need to add better estimators for record size, either via sampling, or by keeping track of stats as we go.
  • Currently, for each read, the reader creates a buffer pool for reading CSV records plus a pool of slabs for reading the CSV file. We might need to change these to per-process pools to avoid high memory pressure on concurrent reads. However care must be taken otherwise it's possible for use to deadlock.

[1]: Ge, Chang et al. “Speculative Distributed CSV Data Parsing for Big Data Analytics.” Proceedings of the 2019 International Conference on Management of Data (2019).

desmondcheongzx avatar Aug 30 '24 21:08 desmondcheongzx

CodSpeed Performance Report

Merging #2772 will degrade performances by 33.78%

Comparing desmondcheongzx:local-csv-reader-experiment (7b40f23) with main (cad9168)

Summary

⚡ 2 improvements ❌ 1 regressions ✅ 13 untouched benchmarks

:warning: Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main desmondcheongzx:local-csv-reader-experiment Change
test_count[1 Small File] 16.5 ms 24.9 ms -33.78%
test_explain[100 Small Files] 52.8 ms 39.9 ms +32.33%
test_show[100 Small Files] 597.4 ms 355.8 ms +67.91%

codspeed-hq[bot] avatar Aug 30 '24 21:08 codspeed-hq[bot]

This branch is a little borked. Keeping a record of it but reopening the PR at https://github.com/Eventual-Inc/Daft/pull/3055

desmondcheongzx avatar Oct 16 '24 01:10 desmondcheongzx