risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

Chaos Mesh pod-kill-one-compute fails occasionally even the cluster seems to recover successfully

Open lmatz opened this issue 1 year ago • 11 comments

Describe the bug

https://buildkite.com/risingwave-test/longevity-chaos-mesh/builds/354#018c6cd1-1d73-460b-b33f-2a11377228a6

This pipeline's code itself is at: https://github.com/risingwavelabs/risingwave-test/pull/526 cc: @xuefengze if any question

We notice that in this pipeline, we executed 12 tests for each nexmark query. (The query is largely irrelevant, just tried to run different chaos multiple times to test more)

Since the tests work for all 11 tests before the 12th one, we suppose the testing pipeline itself is good enough for RW to recover successfully after injecting the chaos.

We have 4 experiments here:

  1. pod-kill all the compute node pods
  2. pod-kill one computer node pods
  3. pod-failure all the compute node pods
  4. pod-failure one compute node pods

The failed one is pod-kill-one-compute. This experiment succeeded three times in the same pipeline linked above. SCR-20231218-id4

The description below is all about this particular failed example.

SCR-20231218-igq We can infer that the killed pod compute-2 was restarted at 19:18:14 - 30s = 19:17:44. The restarts counts of the other two compute node pods are irrelevant as they are caused by previous experiments in the same pipeline.

This is further confirmed by the log from the meta node: https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?from=1702638720000&orgId=1&to=1702639140000&var-data_source=PE59595AED52CF917&var-namespace=longcmkf-20231215-093208&var-pod=benchmark-risingwave-meta-m-0&var-search=

SCR-20231218-ini (this is UTC time)

After 10 seconds-ish, the meta decides that the cluster recovered successfully. SCR-20231218-iod (I suppose 10 seconds is a reasonable time for recovery, considering restarting CN pod takes a few seconds already).

After recovery, we use a set of DDL and DML statements to test if the cluster gets recovered and runs as expected: SCR-20231218-iuc

They include create table/MV, insert and select from table/MV. As we can see from the figure above, some statements failed.

This is also confirmed by the log from the frontend node: https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?from=1702639200000&orgId=1&to=1702639380000&var-data_source=PE59595AED52CF917&var-namespace=longcmkf-20231215-093208&var-pod=benchmark-risingwave-frontend-f-7986879684-d92ws&var-search=

SCR-20231218-ivz

However, this is weird:

  1. during the same period of issuing these SQL statements, no error message is found in the meta node and three compute nodes.
  2. not all of the SQL statements fail, only part of them failed.

Additional Context

We remark that a long time ago, we once had the issue that the meta node could not sense the failure/kill of the compute node pods until 15 min later. But this issue is gone now: details can be found here: https://github.com/risingwavelabs/risingwave-test/issues/440

And its behavior is different from what we observed here.

The version of RisingWave

nightly-20231213

lmatz avatar Dec 18 '23 05:12 lmatz

I found that all the failed cases are about frontend and compute node interactions like query and dml. it's possible that after recovery, there is some problem with the network connection between frontend and compute node, they are both transport error and deadline has elapsed. And interestingly, this Issue didn't show up in the madsim recovery test.

yezizp2012 avatar Dec 18 '23 06:12 yezizp2012

this Issue didn't show up in the madsim recovery test.

do you mean that the network connection issue is more likely to be caused by the setting in the environment instead of RW itself?

lmatz avatar Dec 18 '23 09:12 lmatz

this Issue didn't show up in the madsim recovery test.

do you mean that the network connection issue is more likely to be caused by the setting in the environment instead of RW itself?

Not sure, oh I found one test behaviour is different in madsim's recovery test, it will retry for several times when it encounters an error. I'm guessing it might have something to do with the fact that the frontend is using connection pool for compute nodes, and when compute nodes get restarted and cluster get recovered, if the query or dml on the frontend uses an old connection in the pool, which might report an error at that point. Meta also used a connection pool for compute nodes, but it doesn't have this issue because it will retry until all compute nodes are online during recovery, the connections will all get refreshed in meta then. So how about adding some retry logic as well in chaos mesh test? Since we don't have any auto retry logic in frontend yet, I think it's necessary and in this way we can determine if it's the root cause of this issue.

yezizp2012 avatar Dec 18 '23 09:12 yezizp2012

So how about adding some retry logic as well in chaos mesh test? Since we don't have any auto retry logic in frontend yet, I think it's necessary and in this way we can determine if it's the root cause of this issue.

Sure, cc: @xuefengze, could you help update the results here after the test with retry logic is done

lmatz avatar Dec 19 '23 02:12 lmatz

After adding retry logic, all tests passed. Some SQL commands that initially failed were successful after the retry.

image

https://buildkite.com/risingwave-test/longevity-chaos-mesh/builds/361

xuefengze avatar Dec 19 '23 09:12 xuefengze

One thing I am confused is that we also kill CN frequently in prod cluster, such as yesterday's meeting, and the select query after restarting seems never run into such error.

It seems random? What kind of setting can make reproducing this more frequently?

If it requires an unusual setting, I suppose this is a low-priority issue.

lmatz avatar Dec 21 '23 03:12 lmatz

We can push a ton of data through PostgreSQL CDC very suddenly (Say 100M rows changed at once - which result in updates to a 1T row materialized view.) This will often cause the streaming node to crash. Usually the cluster does not recover after such a crash. We have to rebuild it from scratch.

rotten avatar Feb 26 '24 20:02 rotten

Hi @rotten

Is there any error message/panic in the log? Or is the crash caused by out-of-memory?

lmatz avatar Feb 27 '24 02:02 lmatz

My memory utilization graphs do not show it running out of memory. I don't get any panic or message in the log. I will do more testing today - as soon as I get a chance - to try to reproduce this in a way that I can capture more information about what is happening. I'll let you know.

rotten avatar Feb 27 '24 16:02 rotten

I started by setting the size of my compute-streaming docker container to 3G and then setting --total-memory-bytes to 2000000000 which I believe is 2G. Then I kicked off an initial load. It did OOM, but I got a lot of errors from docker compose when it went down.

Caused by:  java.lang.OutOfMemoryError:  Java heap space
...
thread 'rw-streaming' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/sync/mpsc/chan.rs:278:13:
assertion failed: self.inner.semaphore.is_idle()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtracethread 'rw-streaming' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/sync/mpsc/chan.rs:278:13:
assertion failed: self.inner.semaphore.is_idle()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
...
(core dumped)

I'll try the same thing in a cloud deploy next with RW_BACKTRACE=1. I'm not 100% convinced it is an OOM that is crashing the container, but it is also pretty easy to cause OOM's. I have observed that the compute nodes will try to stay below the --total-memory-bytes setting but sometimes shoot over it, so I always configure it for 2/3 to 3/4 of the available memory in the container. If it shoots over it too fast and too far, the linux OOM killer will take it down.

The cloud logging could be suppressing the OOM error messages if the container goes down before they can be emitted. I'll let you know shortly.

rotten avatar Feb 27 '24 20:02 rotten

Oh, and the java process that OOM'd was the debezium postgres connector.

rotten avatar Feb 27 '24 20:02 rotten

My first attempt yesterday at reproducing the OOM or process crash in the cloud failed - even with a restricted streaming container, it loaded all the data (it is possible the test environment I used has much less data than my laptop). I'll keep poking at it and let you know if I learn anything beyond what I've already found.

rotten avatar Feb 28 '24 14:02 rotten

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

github-actions[bot] avatar Aug 01 '24 02:08 github-actions[bot]