quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Control Plane not exiting on panic

Open UN0wen opened this issue 6 months ago • 12 comments

Describe the bug We're deploying quickwit in cluster mode through helm (chart version 7.15, quickwit version 0.8.2). We noticed today that some new pods are spinning up but not connected to the existing cluster, instead only peering between themselves and crashing since there are no metastore.

Upon inspection, we found this traceback in the control plane logs (the pod is healthy)

thread 'tokio-runtime-worker' panicked at /usr/local/cargo/git/checkouts/chitchat-22cf90d3696646d6/d039699/chitchat/src/delta.rs:413:9:
assertion failed: mtu >= 100
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

It seems like after encountering this error, the control plane does not exit and continues to run, seemingly doing nothing (?).

Expected behavior Control plane should exit with an error if an internal component panics.

Configuration: Please provide:

  1. Output of quickwit --version

quickwit version: 0.8.2 (aarch64-unknown-linux-gnu 2024-06-17T16:36:47Z 42766b8)

UN0wen avatar Jun 05 '25 19:06 UN0wen

This error can happen if you have too many nodes in your cluster. How many nodes are present?

fulmicoton avatar Jun 06 '25 09:06 fulmicoton

We also experienced the same issue with the indexer. We have 12 indexer, 1 controller plane, 1 janitor, 3 meta store, 10 searcher. Restarting the indexer didn't help in this state, and it was only after restarting the metastore that the indexer restarted properly.

earlbread avatar Jun 06 '25 15:06 earlbread

We only have 4 indexers, 1 janitor, 1 metastore, 3 searchers, 1 control plane, though this is in a k8s environment where node autoscaling and consolidation happens semi frequently, could that have contributed?

We were able to resolve this with a control plane restart, restarting other nodes only led to more of a fracture between the two clusters.

UN0wen avatar Jun 06 '25 23:06 UN0wen

@earlbread and I are working together, so I attached additional info.

  • Quickwit version: 0.8.2
  • Component: 12 indexer, 1 control plane, 1 janitor, 3 metastore, 10 searcher
  • I am not sure, but there were likely hundreds of zombie nodes in metastore at that time

So we first restarted metastore, and then sequentially restarted indexer, searcher and janitor to resolve the issue.

rachel-mj-park avatar Jun 09 '25 01:06 rachel-mj-park

About the root cause: Zombie nodes in the the chitchat state is a known issue. It was solved a few months ago.

About the panic issue: too bad you don't have a stack trace 😢 .

If you are adventurous, until we cut a release (long overdue, I know), you could try using the tag qw-airmail-20250522. It includes all recent changes and is pretty stable.

rdettai avatar Jun 10 '25 08:06 rdettai

Thank you for the response. If I upgrade from version 0.8.2 to qw-airmail-20250522, are there any breaking changes?

earlbread avatar Jun 10 '25 15:06 earlbread

It seems like after encountering this error, the control plane does not exit and continues to run, seemingly doing nothing (?).

This is a very valid point. Let's fix this.

likely hundreds of zombie nodes in metastore at that time

That is the root cause of the problem.

fulmicoton avatar Jun 11 '25 08:06 fulmicoton

If I upgrade from version 0.8.2 to qw-airmail-20250522, are there any breaking changes?

The upgrade instructions https://quickwit.io/docs/main-branch/operating/upgrades#migration-from-08-to-09 would apply.

rdettai avatar Jun 12 '25 11:06 rdettai

@rdettai Thank you!

earlbread avatar Jun 12 '25 12:06 earlbread

@rdettai Hi, I see that there are two branches: qw-airmail-20250522 and qw-airmail-20250522-hotfix.

It looks like this commit has been added: Fix Jemalloc not used in regular build

Should I use the qw-airmail-20250522-hotfix branch to update?

earlbread avatar Jun 16 '25 09:06 earlbread

After upgrading from 0.8.2 to qw-airmail-20250522-hotfix, we encountered the following error:

internal error: `failed to deserialize `alloc::vec::Vec<quickwit_metastore::metastore::index_metadata::IndexMetadata>` from JSON: EOF while parsing a value at line 1 column 0`

Due to this error, we attempted to roll back to version 0.8.2, but then encountered the following issue:

run_migrations: quickwit_metastore::metastore::postgres::migrator: failed to run PostgreSQL migrations ...

We resolved the issue by removing all components and restarting everything using the qw-airmail-20250522-hotfix image, bringing up the components in the following order: Control Plane, Meta Store, Indexer, Searcher, and Janitor.

Is there a recommended restart order or an official upgrade guide for version upgrades like this?

earlbread avatar Jun 17 '25 08:06 earlbread

We resolved the issue by removing all components and restarting everything using the qw-airmail-20250522-hotfix image, bringing up the components in the following order: Control Plane, Meta Store, Indexer, Searcher, and Janitor.

I thought the scaling down the indexers first as mentioned in migration guide above was enough. I don't see any change to the IndexMetadata model that would be breaking... I would need to try reproduce.

rdettai avatar Jun 19 '25 12:06 rdettai

@earlbread @rachel-mj-park We’re tired of dealing with this issue. After updating to qw-airmail-20250522, have you noticed any serious problems? Thank you

silentsokolov avatar Oct 16 '25 15:10 silentsokolov

@silentsokolov Since upgrading to qw-airmail-20250522-hotfix, I haven’t encountered this issue or any other critical problems yet.

earlbread avatar Oct 16 '25 16:10 earlbread