Support High-Frequency Deletes & Updates with HNSW
tl;dr
The current delete cleanup logic uses only a single thread and tries to clean up all tombstones in a single run. This can lead to a very long-running process. Instead, the workload should be chunked up and run in parallel.
Description
Currently, a single goroutine lists all tombstones and cleans them up in a single run. This creates two problems:
- The process can run for very long if there are a lot of tombstones. This is especially problematic because if a crash occurs while this is running we might be left in an undesirable state. Only the very last step would mark a node as finally deleted in the commit logger. If a crash occurs before, it would have to be cleaned up again.
- The process cannot be parallelized
Suggested alternative
- Rather than cleaning up all tombstones in a single run, we should clean up at most
ntombstones in a single run (e.g.n = 10000). This way a single cycle can return (and write into the commit log) relatively quickly. - Rather than spawning just one of the goroutines outlined in (1), we should spawn multiple
mgoroutines. (e.g.m = NumCPUs / 2)
For my understanding, besides the obvious accumulation of garbage, what happens if we don't cleanup all tombstones before the server starts? i.e. can they be dealt with in the background by some kind of simple, single threaded garbage collection logic?
Rather than spawning just one of the goroutines outlined in (1), we should spawn multiple m goroutines.
All of the inner functions seem tightly locked so I wonder how far we can go with concurrency, but that's the ideal solution.
Changed the title to widen the scope a bit. It is no longer about a specific way to increase the supported frequency of deletes but rather about increasing it – with any means necessary.
Completed in https://github.com/weaviate/weaviate/pull/4003.