risingwave
risingwave copied to clipboard
bug: stuck in recovery simulation test
Describe the bug
In the CI run of #15262 (cherry-pick #15260), the recovery simulation test is stuck when running batch e2e with MADSIM_TEST_SEED=5
.
Because #15260 can pass all CI tests and can be successfully while the release-1.7 cherry-pick PR #15262 is stuck in recovery simulation test, after a discussion with @wangrunji0408 , we believe that the PR triggers a bug that is not related to the PR changes themselves.
Error message/log
Logs can be found in artifacts in https://buildkite.com/risingwavelabs/pull-request/builds/42771#018de4d6-1fcb-4023-a97e-63007ab1720c
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
git commit: 14531cd37c8e906bc620338d6c4238bb05391b12
Additional context
No response
Note that re-run doesn't help and the recovery simulation test is always stuck when running with MADSIM_TEST_SEED=5
.
Previously fixed issue: https://github.com/risingwavelabs/risingwave/issues/14104.
The fix should be already in 1.7. But this bug could be due to similar reasons.
Previously fixed issue: #14104.
The fix should be already in 1.7. But this bug could be due to similar reasons.
The log indeed stops, indicating that the cluster is stuck or very slow. I cannot reproduce it locally though.
We can use this branch for repro: https://github.com/risingwavelabs/risingwave/commits/patrick/recovery-stuck-repro/
It's a deadlock somewhere in DROP
again. Here all all the tail logs:
2022-04-05T06:39:21.652236Z DEBUG risingwave_simulation::slt: Running cmd=Drop
2022-04-05T06:39:21.652236Z DEBUG risingwave_simulation::slt: retry count iteration=0
2022-04-05T06:39:21.726798Z INFO risingwave_simulation::cluster: kill compute-2
2022-04-05T06:39:21.726801Z INFO risingwave_meta::barrier::recovery: recovery start!
2022-04-05T06:39:21.726801Z DEBUG risingwave_meta::manager::catalog: creating_tables ids: []
2022-04-05T06:39:21.734000Z DEBUG risingwave_meta::barrier::recovery: clean dirty table fragments: {}
2022-04-05T06:39:21.748186Z INFO risingwave_meta::barrier::recovery: recovering mview progress
2022-04-05T06:39:21.762740Z INFO risingwave_meta::barrier::recovery: recovered mview progress
2022-04-05T06:39:21.892556Z INFO risingwave_simulation::cluster: kill compute-1
2022-04-05T06:39:22.043266Z INFO risingwave_simulation::cluster: kill meta-1
2022-04-05T06:39:22.088142Z INFO risingwave_simulation::cluster: kill compute-3
2022-04-05T06:39:22.103008Z INFO risingwave_simulation::cluster: restart compute-3
2022-04-05T06:39:22.120371Z INFO risingwave_simulation::cluster: restart meta-1
2022-04-05T06:39:22.185730Z INFO risingwave_simulation::cluster: kill compactor-1
2022-04-05T06:39:22.299107Z INFO risingwave_simulation::cluster: kill compactor-2
2022-04-05T06:39:22.384536Z INFO risingwave_simulation::cluster: restart compute-1
2022-04-05T06:39:22.452952Z INFO risingwave_simulation::cluster: restart compactor-2
2022-04-05T06:39:23.051853Z INFO risingwave_simulation::cluster: restart compactor-1
2022-04-05T06:39:42.109102Z INFO risingwave_simulation::cluster: restart compute-2
I gave it 15 mins more to execute here: https://github.com/risingwavelabs/risingwave/pull/15512/files.
Notice the DROP
command:
2022-04-05T06:39:21.652236Z DEBUG risingwave_simulation::slt: Running cmd=Drop
Something is blocking it from completion.
Another issue that could be related: https://github.com/risingwavelabs/risingwave/issues/14910
Root cause is deadlock between barrier recovery and drop stream job.
![]()
Root cause is deadlock between barrier recovery and drop stream job.
Great finding! Just curious: how do you debug the issue? Did you use any deadlock detection tool?
Great finding! Just curious: how do you debug the issue? Did you use any deadlock detection tool?
Nope I just logged it, since I suspect deadlock due to drop 🤣 Any tools to share, perhaps we should add some to our toolbox.