Improve Primary Key Index Creation and Persistence

Open pdames opened this issue 2 years ago • 1 comments

Although this can work as an interim implementation, we'll want to quickly follow it with a solution like (1) using something like AllReduce (or perhaps a different Actor-based design) to map source file records to destination file records, (2) amortizing the cost of generating the compacted table's primary key index by only generating it periodically (and appending source file pointers to the tail of the index between periodic runs), or (3) creating the compacted table's primary key index by generating it from our in-memory files to materialize and having a final step at the end of each round to shuffle it back into the appropriate hash buckets.

There are probably some other options here too, but I guess the main point is that, even though building a primary key index off of source files will usually work (as long as the source files all continue to exist forever, which isn't true for any catalog that enforces data retention policies), it can also present some pretty major performance issues for degenerate tables where we wind up either always reading too many small files or a few overly large files. The likelihood of this problem occurring/growing will also increase over time as we retain pointers to a larger number of source files.

Originally posted by @pdames in https://github.com/ray-project/deltacat/pull/56#discussion_r1095329453

Feb 03 '23 06:02 pdames

Building primary key index after materialize is staged in this branch: https://github.com/ray-project/deltacat/tree/post_pki_build until correctness tests are completed. However, it was manually tested for scale and seems to improve overall hash bucket and dedupe time.

Feb 10 '23 21:02 raghumdani