kuzu icon indicating copy to clipboard operation
kuzu copied to clipboard

Optimization: Rel batch insert memory usage improvement

Open ray6080 opened this issue 8 months ago • 0 comments

Description

COPY into a rel table can consume very large amount of memory for some large num of rels dataset. For example, https://www.cs.cornell.edu/~arb/data/colisten-Spotify/ and also the one mentioned in #3660.

The memory usage should mainly come from the design of partitioning and materializing all rel tuples first before constructing node groups in parallel, though more detailed profiling into memory usage is needed to figure out if we can squeeze unnecessary memory usage in some data structures.

To ease the peak memory usage, we should address/make several changes regarding to the rel batch insert pipeline:

  • Make memory allocation of ColumnChunk go through MM, thus the memory usage can be bound by BM (#3660).
  • Allow spilling to disk during partitioning, so we can sequentially read back each partition when constructing node groups in parallel. (Note that for the case a node group is very large, in the extreme case only one node group exists, we should optimize this further later with the segmenting design to chunk node group into more fine-grained storage units).
  • Lightweight compression for in-memory materialized ColumnChunks. Either dynamically typed integer columns (i.e. switching between 8,16,32,64-bit integer values), or byte-packed integer columns (8, 16, 24, 32, ...). The advantage over bitpacking is that we would need to re-pack the data less frequently, and switching between natively supported types instead of packing could have a performance benefit. Both options would make interfaces more complicated since direct memory access would not generally work (e.g. unless we did the same thing with ValueVector, memcpy directly from ValueVector to ColumnChunk would not be possible since the types would have different sizes and we'd need to check the range of the new values).

ray6080 avatar Jun 24 '24 17:06 ray6080