kuzu
kuzu copied to clipboard
Hash Index Rework
Problems
There are several issues in terms of usability and performance with our hash index.
Performance wise:
- ~~It doesn't scale to multiple threads.~~
- ~~It doesn't support rehash dynamically, thus, the CSV reader is forced to count exact num of tuples, which is often slow.~~
- ~~https://github.com/kuzudb/kuzu/issues/2626~~
- More optimizations can be integrated to improve the performance of the index data structure, like fingerprints, stashed buckets, balanced insertions, and displacement in presented in Dash[^1].
Usability wise:
- ~~It wasn't coded in a way to scale to different data types for keys.~~ (#2728)
- ~~It limits string keys to be equal or less than 4KB.~~ (fixed in #2689)
- https://github.com/kuzudb/kuzu/issues/2625
- Need to support multi-copy.
TODOs
- [x] Parallel hash index.
- [x] Support dynamic growing and remove counting from CSV reader.
- [ ] Rework string layout to get rid of ku_string_t.
- [x] Add fingerprint optimization.
- [x] Rework to scale to various key data types. (#2728)
- [ ] Separate the hash index building into a separate physical operator.
- [ ] Add support of
CREATE INDEX
, and alter node table to define primary key (defining the primary key when defining a node table is no longer required, but require the primary key exists when defining a rel table over it). - [ ] Merge hash indexes into a single file. (to be debated whether directly merged into
data.kz
or keep aindex.kz
file).