quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

decommissioning ingest_v2 hangs

Open PSeitz opened this issue 1 year ago • 6 comments

When trying to quit quickwit, it occasionally hangs at decommissioning ingester. Only kill -9 works

^C2024-06-03T09:35:15.935Z  INFO quickwit_ingest::ingest_v2::ingester: decommissioning ingester
^C

I couldn't reproduce it, seems to happen randomly.

PSeitz avatar Jun 03 '24 09:06 PSeitz

If there's data inflight, have you waited long enough for the next commit? It can take up to commit_timeout_secs + ε.

guilload avatar Jun 03 '24 15:06 guilload

In some cases I didn't ingest data, so there shouldn't be any data inflight. I also didn't enable ingestv2 via QW_ENABLE_INGEST_V2.

PSeitz avatar Jun 04 '24 23:06 PSeitz

Here is the bug.

Executing the REST API tests against a running node that does not have ingest V2 enabled still creates a shard and drops a few records in it. The shard does not get indexed and is not cleaned up. The next time you try to decommission the node without v2 enabled, we wait for the shard to be drained, which will never happen.

guilload avatar Jun 07 '24 14:06 guilload

Executing the REST API tests against a running node that does not have ingest V2 enabled still creates a shard and drops a few records in it. The shard does not get indexed and is not cleaned up. The next time you try to decommission the node without v2 enabled, we wait for the shard to be drained, which will never happen.

I don´t think this is the only issue. See analysis in https://github.com/quickwit-oss/quickwit/pull/5283.

TL;DR, I would say they are 2 other issues:

  • when shutting down the control plane before it has a chance to schedule the ingest pipeline of a new indexer node, that node will hang forever during shutdown because its shard never gets indexed
  • when shutting down all nodes of a cluster at once, the indexer tries to commit one last empty batch (I would assume to notify the the shard is closed), but it indefinitely fails doing so as the metastore/cp are not there anymore

rdettai avatar Aug 02 '24 08:08 rdettai

In https://github.com/quickwit-oss/quickwit/pull/5283 I added 2 integration tests covering the two issues mentioned above:

  • ingest_tests::test_shutdown_metastore_first

Both pass on ingest V1 and fail if we enable ingest V2

rdettai avatar Aug 02 '24 13:08 rdettai

Some extra docs: https://github.com/quickwit-oss/quickwit/pull/5418

rdettai avatar Jan 06 '25 09:01 rdettai