risingwave
risingwave copied to clipboard
Chaos Mesh pod-kill-one-compute fails occasionally even the cluster seems to recover successfully
Describe the bug
https://buildkite.com/risingwave-test/longevity-chaos-mesh/builds/354#018c6cd1-1d73-460b-b33f-2a11377228a6
This pipeline's code itself is at: https://github.com/risingwavelabs/risingwave-test/pull/526 cc: @xuefengze if any question
We notice that in this pipeline, we executed 12 tests for each nexmark query. (The query is largely irrelevant, just tried to run different chaos multiple times to test more)
Since the tests work for all 11 tests before the 12th one, we suppose the testing pipeline itself is good enough for RW to recover successfully after injecting the chaos.
We have 4 experiments here:
-
pod-kill
all the compute node pods -
pod-kill
one computer node pods -
pod-failure
all the compute node pods -
pod-failure
one compute node pods
The failed one is pod-kill-one-compute
. This experiment succeeded three times in the same pipeline linked above.
The description below is all about this particular failed example.
We can infer that the killed pod
compute-2
was restarted at 19:18:14
- 30s
= 19:17:44
. The restarts
counts of the other two compute node pods are irrelevant as they are caused by previous experiments in the same pipeline.
This is further confirmed by the log from the meta node: https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?from=1702638720000&orgId=1&to=1702639140000&var-data_source=PE59595AED52CF917&var-namespace=longcmkf-20231215-093208&var-pod=benchmark-risingwave-meta-m-0&var-search=
(this is UTC time)
After 10 seconds-ish, the meta decides that the cluster recovered successfully.
(I suppose 10 seconds is a reasonable time for recovery, considering restarting CN pod takes a few seconds already).
After recovery, we use a set of DDL and DML statements to test if the cluster gets recovered and runs as expected:
They include create table/MV
, insert
and select from table/MV
.
As we can see from the figure above, some statements failed.
This is also confirmed by the log from the frontend node: https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?from=1702639200000&orgId=1&to=1702639380000&var-data_source=PE59595AED52CF917&var-namespace=longcmkf-20231215-093208&var-pod=benchmark-risingwave-frontend-f-7986879684-d92ws&var-search=
However, this is weird:
- during the same period of issuing these SQL statements, no error message is found in the meta node and three compute nodes.
- not all of the SQL statements fail, only part of them failed.
Additional Context
We remark that a long time ago, we once had the issue that the meta node could not sense the failure/kill of the compute node pods until 15 min later. But this issue is gone now: details can be found here: https://github.com/risingwavelabs/risingwave-test/issues/440
And its behavior is different from what we observed here.
The version of RisingWave
nightly-20231213
I found that all the failed cases are about frontend and compute node interactions like query and dml. it's possible that after recovery, there is some problem with the network connection between frontend and compute node, they are both transport error
and deadline has elapsed
. And interestingly, this Issue didn't show up in the madsim recovery test.
this Issue didn't show up in the madsim recovery test.
do you mean that the network connection issue is more likely to be caused by the setting in the environment instead of RW itself?
this Issue didn't show up in the madsim recovery test.
do you mean that the network connection issue is more likely to be caused by the setting in the environment instead of RW itself?
Not sure, oh I found one test behaviour is different in madsim's recovery test, it will retry for several times when it encounters an error. I'm guessing it might have something to do with the fact that the frontend is using connection pool for compute nodes, and when compute nodes get restarted and cluster get recovered, if the query or dml on the frontend uses an old connection in the pool, which might report an error at that point. Meta also used a connection pool for compute nodes, but it doesn't have this issue because it will retry until all compute nodes are online during recovery, the connections will all get refreshed in meta then. So how about adding some retry logic as well in chaos mesh test? Since we don't have any auto retry logic in frontend yet, I think it's necessary and in this way we can determine if it's the root cause of this issue.
So how about adding some retry logic as well in chaos mesh test? Since we don't have any auto retry logic in frontend yet, I think it's necessary and in this way we can determine if it's the root cause of this issue.
Sure, cc: @xuefengze, could you help update the results here after the test with retry
logic is done
After adding retry
logic, all tests passed. Some SQL commands that initially failed were successful after the retry.
https://buildkite.com/risingwave-test/longevity-chaos-mesh/builds/361
One thing I am confused is that we also kill CN frequently in prod cluster, such as yesterday's meeting, and the select query after restarting seems never run into such error.
It seems random? What kind of setting can make reproducing this more frequently?
If it requires an unusual setting, I suppose this is a low-priority issue.
We can push a ton of data through PostgreSQL CDC very suddenly (Say 100M rows changed at once - which result in updates to a 1T row materialized view.) This will often cause the streaming node to crash. Usually the cluster does not recover after such a crash. We have to rebuild it from scratch.
Hi @rotten
Is there any error message/panic in the log? Or is the crash caused by out-of-memory?
My memory utilization graphs do not show it running out of memory. I don't get any panic or message in the log. I will do more testing today - as soon as I get a chance - to try to reproduce this in a way that I can capture more information about what is happening. I'll let you know.
I started by setting the size of my compute-streaming docker container to 3G and then setting --total-memory-bytes
to 2000000000 which I believe is 2G. Then I kicked off an initial load. It did OOM, but I got a lot of errors from docker compose when it went down.
Caused by: java.lang.OutOfMemoryError: Java heap space
...
thread 'rw-streaming' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/sync/mpsc/chan.rs:278:13:
assertion failed: self.inner.semaphore.is_idle()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtracethread 'rw-streaming' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/sync/mpsc/chan.rs:278:13:
assertion failed: self.inner.semaphore.is_idle()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
...
(core dumped)
I'll try the same thing in a cloud deploy next with RW_BACKTRACE=1
. I'm not 100% convinced it is an OOM that is crashing the container, but it is also pretty easy to cause OOM's. I have observed that the compute nodes will try to stay below the --total-memory-bytes
setting but sometimes shoot over it, so I always configure it for 2/3 to 3/4 of the available memory in the container. If it shoots over it too fast and too far, the linux OOM killer will take it down.
The cloud logging could be suppressing the OOM error messages if the container goes down before they can be emitted. I'll let you know shortly.
Oh, and the java process that OOM'd was the debezium postgres connector.
My first attempt yesterday at reproducing the OOM or process crash in the cloud failed - even with a restricted streaming container, it loaded all the data (it is possible the test environment I used has much less data than my laptop). I'll keep poking at it and let you know if I learn anything beyond what I've already found.
This issue has been open for 60 days with no activity.
If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity
label.
You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄