quickwit
quickwit copied to clipboard
decommissioning ingest_v2 hangs
When trying to quit quickwit, it occasionally hangs at decommissioning ingester. Only kill -9 works
^C2024-06-03T09:35:15.935Z INFO quickwit_ingest::ingest_v2::ingester: decommissioning ingester
^C
I couldn't reproduce it, seems to happen randomly.
If there's data inflight, have you waited long enough for the next commit? It can take up to commit_timeout_secs + ε.
In some cases I didn't ingest data, so there shouldn't be any data inflight. I also didn't enable ingestv2 via QW_ENABLE_INGEST_V2.
Here is the bug.
Executing the REST API tests against a running node that does not have ingest V2 enabled still creates a shard and drops a few records in it. The shard does not get indexed and is not cleaned up. The next time you try to decommission the node without v2 enabled, we wait for the shard to be drained, which will never happen.
Executing the REST API tests against a running node that does not have ingest V2 enabled still creates a shard and drops a few records in it. The shard does not get indexed and is not cleaned up. The next time you try to decommission the node without v2 enabled, we wait for the shard to be drained, which will never happen.
I don´t think this is the only issue. See analysis in https://github.com/quickwit-oss/quickwit/pull/5283.
TL;DR, I would say they are 2 other issues:
- when shutting down the control plane before it has a chance to schedule the ingest pipeline of a new indexer node, that node will hang forever during shutdown because its shard never gets indexed
- when shutting down all nodes of a cluster at once, the indexer tries to commit one last empty batch (I would assume to notify the the shard is closed), but it indefinitely fails doing so as the metastore/cp are not there anymore
In https://github.com/quickwit-oss/quickwit/pull/5283 I added 2 integration tests covering the two issues mentioned above:
ingest_tests::test_shutdown_metastore_first
Both pass on ingest V1 and fail if we enable ingest V2
Some extra docs: https://github.com/quickwit-oss/quickwit/pull/5418