quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Refactor garbage collection with the new janitor service coming

Open fmassot opened this issue 3 years ago • 5 comments

Current situation

Garbage collection is runned on a indexer for each pipeline. The garbage collection of splits is intertwined with the IndexingSplitStore. This store is keeping splits locally for merge purposes but currently the removal of these splits is done by the garbage collector.

Issues

This is bad for several reasons:

  • we should do it by index as mentioned in #1572.
  • this does not work if we have several indexers.
  • IndexingSplitStore should delete local splits as soon as splits are published and mature and should not rely on the garbage collector.

Targeted solution

We want to run garbage collection on a janitor service (#1774, #1773). This implies separating the garbage collection logic from the IndexingSplitStore logic.

TODO:

  • Implement a garbage collection logic at the indexer level for the IndexingSplitStore so that we don't keep useless splits locally.
  • Refactor GC to operate on multiple indexes and move to janitor service.

IndexingSplitStore refactoring should resolve https://github.com/quickwit-oss/quickwit/issues/855

fmassot avatar Jul 27 '22 15:07 fmassot

On refactoring the SplitStore

@guilload @fmassot We want a global SplitStore per indexing node. Currently, we have a split-store per source. We kind of need to decide the indexing directory layout. My idea is that we create a top level SplitStore that will manage all indexes SplitStore. So we get rid of the split-store per source separation, all indexing pipelines for a given index work in the same cache and scratchdirectory.

  • Is there any constrain/benefit from what we are doing currently by separating split-store per source?
  • Is it always ok to merge splits from different sources (source_id)?
qwdata/indexing
├── index_a
│   ├── caches
│   │   └── splits
│   └── scratch
└── index_b
    ├── cache
    │   └── splits
    └── scratch

evanxg852000 avatar Sep 14 '22 11:09 evanxg852000

Let's not merge splits from different sources. After the refactor, the merge key should be (node_id, index_id, source_id).

guilload avatar Sep 14 '22 13:09 guilload

Is it always ok to merge splits from different sources (source_id)?

I'm not sure of that right now but this question seems unrelated to the fundamental problem we want to solve: having a global split cache.

Is there any constrain/benefit from what we are doing currently by separating split-store per source?

The benefit is isolation between indexing pipelines, it's always nice to have different working directories, also for debugging purpose it's convenient.

My personal opinion on this: let's focus on the problem of having a global cache split directory so that we can control its size. To do that we can simply instantiate one LocalSplitStore and give it to each indexing pipeline. It should work as is. We could also improve how splits are organized in the cache directory, but I see that as an improvement.

fmassot avatar Sep 14 '22 14:09 fmassot

Actually my solution is not working because of the split cleanup we are doing when we start the server. I need to rethink about that.

fmassot avatar Sep 14 '22 14:09 fmassot

Currently, we have a split-store per source. This is inaccurate. Currently, there is one split store indexing pipeline.

The layout you describe is also inaccurate. The current layout is <data dir>/indexing/<index ID>/<source ID>/{cache,scratch}. As @fmassot commented, this layout serves us well for isolating indexing pipelines and debugging, so let's preserve it.

The way things are working currently:

  • one split store per indexing pipeline
  • stale splits in the split store are GCed on startup
  • one merge pipeline per indexing pipeline
  • indexing pipeline and merge pipeline share split store
  • merge pipeline evicts merge splits from split store

We want to enable cross indexing pipeline merges. So we need:

  • one split store per pair (index_id, source_id), i.e. one split store shared across the indexing pipelines working on the same source.
  • one merge pipeline per (index_id, source_id)
  • indexing pipeline and merge pipeline share split store (same as before)
  • merge pipeline evicts merge splits from split store (same as before)

This does not solve the controlling for the global size of the split stores but let's tackle that in another issue/PR.

guilload avatar Sep 14 '22 14:09 guilload

Closed via #2178.

guilload avatar Oct 31 '22 23:10 guilload