Control Plane not exiting on panic
Describe the bug We're deploying quickwit in cluster mode through helm (chart version 7.15, quickwit version 0.8.2). We noticed today that some new pods are spinning up but not connected to the existing cluster, instead only peering between themselves and crashing since there are no metastore.
Upon inspection, we found this traceback in the control plane logs (the pod is healthy)
thread 'tokio-runtime-worker' panicked at /usr/local/cargo/git/checkouts/chitchat-22cf90d3696646d6/d039699/chitchat/src/delta.rs:413:9:
assertion failed: mtu >= 100
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
It seems like after encountering this error, the control plane does not exit and continues to run, seemingly doing nothing (?).
Expected behavior Control plane should exit with an error if an internal component panics.
Configuration: Please provide:
- Output of
quickwit --version
quickwit version: 0.8.2 (aarch64-unknown-linux-gnu 2024-06-17T16:36:47Z 42766b8)
This error can happen if you have too many nodes in your cluster. How many nodes are present?
We also experienced the same issue with the indexer. We have 12 indexer, 1 controller plane, 1 janitor, 3 meta store, 10 searcher. Restarting the indexer didn't help in this state, and it was only after restarting the metastore that the indexer restarted properly.
We only have 4 indexers, 1 janitor, 1 metastore, 3 searchers, 1 control plane, though this is in a k8s environment where node autoscaling and consolidation happens semi frequently, could that have contributed?
We were able to resolve this with a control plane restart, restarting other nodes only led to more of a fracture between the two clusters.
@earlbread and I are working together, so I attached additional info.
- Quickwit version:
0.8.2 - Component: 12 indexer, 1 control plane, 1 janitor, 3 metastore, 10 searcher
- I am not sure, but there were likely hundreds of zombie nodes in metastore at that time
So we first restarted metastore, and then sequentially restarted indexer, searcher and janitor to resolve the issue.
About the root cause: Zombie nodes in the the chitchat state is a known issue. It was solved a few months ago.
About the panic issue: too bad you don't have a stack trace 😢 .
If you are adventurous, until we cut a release (long overdue, I know), you could try using the tag qw-airmail-20250522. It includes all recent changes and is pretty stable.
Thank you for the response.
If I upgrade from version 0.8.2 to qw-airmail-20250522, are there any breaking changes?
It seems like after encountering this error, the control plane does not exit and continues to run, seemingly doing nothing (?).
This is a very valid point. Let's fix this.
likely hundreds of zombie nodes in metastore at that time
That is the root cause of the problem.
If I upgrade from version 0.8.2 to qw-airmail-20250522, are there any breaking changes?
The upgrade instructions https://quickwit.io/docs/main-branch/operating/upgrades#migration-from-08-to-09 would apply.
@rdettai Thank you!
@rdettai Hi, I see that there are two branches: qw-airmail-20250522 and qw-airmail-20250522-hotfix.
It looks like this commit has been added: Fix Jemalloc not used in regular build
Should I use the qw-airmail-20250522-hotfix branch to update?
After upgrading from 0.8.2 to qw-airmail-20250522-hotfix, we encountered the following error:
internal error: `failed to deserialize `alloc::vec::Vec<quickwit_metastore::metastore::index_metadata::IndexMetadata>` from JSON: EOF while parsing a value at line 1 column 0`
Due to this error, we attempted to roll back to version 0.8.2, but then encountered the following issue:
run_migrations: quickwit_metastore::metastore::postgres::migrator: failed to run PostgreSQL migrations ...
We resolved the issue by removing all components and restarting everything using the qw-airmail-20250522-hotfix image, bringing up the components in the following order: Control Plane, Meta Store, Indexer, Searcher, and Janitor.
Is there a recommended restart order or an official upgrade guide for version upgrades like this?
We resolved the issue by removing all components and restarting everything using the qw-airmail-20250522-hotfix image, bringing up the components in the following order: Control Plane, Meta Store, Indexer, Searcher, and Janitor.
I thought the scaling down the indexers first as mentioned in migration guide above was enough. I don't see any change to the IndexMetadata model that would be breaking... I would need to try reproduce.
@earlbread @rachel-mj-park We’re tired of dealing with this issue. After updating to qw-airmail-20250522, have you noticed any serious problems? Thank you
@silentsokolov Since upgrading to qw-airmail-20250522-hotfix, I haven’t encountered this issue or any other critical problems yet.