iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

Concurrent data file fetching and parallel RecordBatch processing

Open sdd opened this issue 1 year ago • 1 comments

This brings some big performance gains vs the previous sequential batch processing. On my 12-core Ryzen 9 5900X, I see all 12 cores hitting about 50% utilization.

Performance on retrieval of all the data on a full table scan in my perf testing branch for this hit 84 million rows in 7s, or over 11M rows/sec. Real world could be quite a bit faster as 50% of the CPU usage was for Minio serving up the data files.

As with the concurrent file plan PR, the concurrency config has been set to fast defaults based on testing a range of values but can be user-configured.

Performance test results, generated using the tests in https://github.com/apache/iceberg-rust/pull/497:

image

sdd avatar Jul 31 '24 20:07 sdd