risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

bug: stuck in recovery simulation test

Open hzxa21 opened this issue 1 year ago • 4 comments

Describe the bug

In the CI run of #15262 (cherry-pick #15260), the recovery simulation test is stuck when running batch e2e with MADSIM_TEST_SEED=5.

Because #15260 can pass all CI tests and can be successfully while the release-1.7 cherry-pick PR #15262 is stuck in recovery simulation test, after a discussion with @wangrunji0408 , we believe that the PR triggers a bug that is not related to the PR changes themselves.

Error message/log

Logs can be found in artifacts in https://buildkite.com/risingwavelabs/pull-request/builds/42771#018de4d6-1fcb-4023-a97e-63007ab1720c

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

git commit: 14531cd37c8e906bc620338d6c4238bb05391b12

Additional context

No response

hzxa21 avatar Feb 27 '24 09:02 hzxa21

Note that re-run doesn't help and the recovery simulation test is always stuck when running with MADSIM_TEST_SEED=5.

hzxa21 avatar Feb 27 '24 09:02 hzxa21

Previously fixed issue: https://github.com/risingwavelabs/risingwave/issues/14104.

The fix should be already in 1.7. But this bug could be due to similar reasons.

kwannoel avatar Feb 27 '24 09:02 kwannoel

Previously fixed issue: #14104.

The fix should be already in 1.7. But this bug could be due to similar reasons.

The log indeed stops, indicating that the cluster is stuck or very slow. I cannot reproduce it locally though.

hzxa21 avatar Feb 27 '24 09:02 hzxa21

We can use this branch for repro: https://github.com/risingwavelabs/risingwave/commits/patrick/recovery-stuck-repro/

hzxa21 avatar Feb 27 '24 10:02 hzxa21

It's a deadlock somewhere in DROP again. Here all all the tail logs:

2022-04-05T06:39:21.652236Z DEBUG risingwave_simulation::slt: Running cmd=Drop
2022-04-05T06:39:21.652236Z DEBUG risingwave_simulation::slt: retry count iteration=0
2022-04-05T06:39:21.726798Z  INFO risingwave_simulation::cluster: kill compute-2
2022-04-05T06:39:21.726801Z  INFO risingwave_meta::barrier::recovery: recovery start!
2022-04-05T06:39:21.726801Z DEBUG risingwave_meta::manager::catalog: creating_tables ids: []
2022-04-05T06:39:21.734000Z DEBUG risingwave_meta::barrier::recovery: clean dirty table fragments: {}
2022-04-05T06:39:21.748186Z  INFO risingwave_meta::barrier::recovery: recovering mview progress
2022-04-05T06:39:21.762740Z  INFO risingwave_meta::barrier::recovery: recovered mview progress
2022-04-05T06:39:21.892556Z  INFO risingwave_simulation::cluster: kill compute-1
2022-04-05T06:39:22.043266Z  INFO risingwave_simulation::cluster: kill meta-1
2022-04-05T06:39:22.088142Z  INFO risingwave_simulation::cluster: kill compute-3
2022-04-05T06:39:22.103008Z  INFO risingwave_simulation::cluster: restart compute-3
2022-04-05T06:39:22.120371Z  INFO risingwave_simulation::cluster: restart meta-1
2022-04-05T06:39:22.185730Z  INFO risingwave_simulation::cluster: kill compactor-1
2022-04-05T06:39:22.299107Z  INFO risingwave_simulation::cluster: kill compactor-2
2022-04-05T06:39:22.384536Z  INFO risingwave_simulation::cluster: restart compute-1
2022-04-05T06:39:22.452952Z  INFO risingwave_simulation::cluster: restart compactor-2
2022-04-05T06:39:23.051853Z  INFO risingwave_simulation::cluster: restart compactor-1
2022-04-05T06:39:42.109102Z  INFO risingwave_simulation::cluster: restart compute-2

I gave it 15 mins more to execute here: https://github.com/risingwavelabs/risingwave/pull/15512/files.

Notice the DROP command:

2022-04-05T06:39:21.652236Z DEBUG risingwave_simulation::slt: Running cmd=Drop

Something is blocking it from completion.

kwannoel avatar Mar 07 '24 12:03 kwannoel

Another issue that could be related: https://github.com/risingwavelabs/risingwave/issues/14910

kwannoel avatar Mar 07 '24 13:03 kwannoel

Screenshot 2024-03-26 at 11 26 02 AM

Root cause is deadlock between barrier recovery and drop stream job.

kwannoel avatar Mar 26 '24 04:03 kwannoel

Screenshot 2024-03-26 at 11 26 02 AM

Root cause is deadlock between barrier recovery and drop stream job.

Great finding! Just curious: how do you debug the issue? Did you use any deadlock detection tool?

hzxa21 avatar Mar 26 '24 05:03 hzxa21

Great finding! Just curious: how do you debug the issue? Did you use any deadlock detection tool?

Nope I just logged it, since I suspect deadlock due to drop 🤣 Any tools to share, perhaps we should add some to our toolbox.

kwannoel avatar Mar 26 '24 05:03 kwannoel