kuzu Hash Index Rework

Hash Index Rework

Open ray6080 opened this issue 1 year ago • 0 comments

There are several issues in terms of usability and performance with our hash index.

Performance wise:

~~It doesn't scale to multiple threads.~~
~~It doesn't support rehash dynamically, thus, the CSV reader is forced to count exact num of tuples, which is often slow.~~
~~https://github.com/kuzudb/kuzu/issues/2626~~
More optimizations can be integrated to improve the performance of the index data structure, like fingerprints, stashed buckets, balanced insertions, and displacement in presented in Dash[^1].

Usability wise:

[x] Parallel hash index.
[x] Support dynamic growing and remove counting from CSV reader.
[ ] Rework string layout to get rid of ku_string_t.
[x] Add fingerprint optimization.
[x] Rework to scale to various key data types. (#2728)
[ ] Separate the hash index building into a separate physical operator.
[ ] Add support of CREATE INDEX, and alter node table to define primary key (defining the primary key when defining a node table is no longer required, but require the primary key exists when defining a rel table over it).
[ ] Merge hash indexes into a single file. (to be debated whether directly merged into data.kz or keep a index.kz file).

Oct 27 '23 18:10 ray6080