kuzu icon indicating copy to clipboard operation
kuzu copied to clipboard

Hash Index Rework

Open ray6080 opened this issue 1 year ago • 0 comments

Problems

There are several issues in terms of usability and performance with our hash index.

Performance wise:

  1. ~~It doesn't scale to multiple threads.~~
  2. ~~It doesn't support rehash dynamically, thus, the CSV reader is forced to count exact num of tuples, which is often slow.~~
  3. ~~https://github.com/kuzudb/kuzu/issues/2626~~
  4. More optimizations can be integrated to improve the performance of the index data structure, like fingerprints, stashed buckets, balanced insertions, and displacement in presented in Dash[^1].

Usability wise:

  1. ~~It wasn't coded in a way to scale to different data types for keys.~~ (#2728)
  2. ~~It limits string keys to be equal or less than 4KB.~~ (fixed in #2689)
  3. https://github.com/kuzudb/kuzu/issues/2625
  4. Need to support multi-copy.

TODOs

  • [x] Parallel hash index.
  • [x] Support dynamic growing and remove counting from CSV reader.
  • [ ] Rework string layout to get rid of ku_string_t.
  • [x] Add fingerprint optimization.
  • [x] Rework to scale to various key data types. (#2728)
  • [ ] Separate the hash index building into a separate physical operator.
  • [ ] Add support of CREATE INDEX, and alter node table to define primary key (defining the primary key when defining a node table is no longer required, but require the primary key exists when defining a rel table over it).
  • [ ] Merge hash indexes into a single file. (to be debated whether directly merged into data.kz or keep a index.kz file).

[^1]: Dash: Scalable Hashing on Persistent Memory

ray6080 avatar Oct 27 '23 18:10 ray6080