iceberg-rust Concurrent data file fetching and parallel RecordBatch processing

Concurrent data file fetching and parallel RecordBatch processing

Open sdd opened this issue 1 year ago • 1 comments

This brings some big performance gains vs the previous sequential batch processing. On my 12-core Ryzen 9 5900X, I see all 12 cores hitting about 50% utilization.

Performance on retrieval of all the data on a full table scan in my perf testing branch for this hit 84 million rows in 7s, or over 11M rows/sec. Real world could be quite a bit faster as 50% of the CPU usage was for Minio serving up the data files.

As with the concurrent file plan PR, the concurrency config has been set to fast defaults based on testing a range of values but can be user-configured.

Performance test results, generated using the tests in https://github.com/apache/iceberg-rust/pull/497:

Jul 31 '24 20:07 sdd

iceberg-rust iceberg-rust copied to clipboard

Concurrent data file fetching and parallel RecordBatch processing

iceberg-rust
iceberg-rust copied to clipboard