risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

Barrier: Some actor does not collect barrier so that epoch can not commit

Open Little-Wallace opened this issue 3 years ago • 9 comments

Describe the bug A clear and concise description of what the bug is.

Run ./scripts/launch_risedev_bench.sh in https://github.com/singularity-data/tpch-bench and I found that compute-node all stop processing data in a few minutes. And the metrics barrier-send-latency does not show anything.

I add more log in compute-node and meta-node and I found that some actor does not run LocalBarrierManager::collect so that the rpc in StreamService::barrier_complete does not return result to meta-service and meta-service could not commit this epoch. It seems that some actor is blocked so that it can not do anything.

To Reproduce

deploy a new risingwave cluster in AWS EC2 as https://singularity-data.quip.com/EbTfA4xUdJ2z/Benchmark-RisingWave-Three-Easy-Steps (We need at least three compute-node) and run ./scripts/launch_risedev_bench.sh. Then you can observe you cluster are blocked.

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

Little-Wallace avatar Jul 28 '22 11:07 Little-Wallace

Which revision does this bug occur on?

BugenZhao avatar Jul 28 '22 12:07 BugenZhao

Which revision does this bug occur on?

I found it in new Main branch yesterday

xxhZs avatar Jul 28 '22 12:07 xxhZs

Which revision does this bug occur on?

I found it in new Main branch yesterday

Do you have an exact commit hash?

BugenZhao avatar Jul 31 '22 06:07 BugenZhao

Which revision does this bug occur on?

I found it in new Main branch yesterday

Do you have an exact commit hash?

8ad51838db0f4bc013ad0f05a92421ba8f0200df

xxhZs avatar Jul 31 '22 06:07 xxhZs

I'm also encountering deadlock of "epoch can not commit" when running nexmark q5/q7 with very large parallelism (not related to #4354). May be related to this.

BugenZhao avatar Aug 02 '22 09:08 BugenZhao

@BugenZhao I find that this bug may be caused by https://github.com/singularity-data/risingwave/pull/4045 with binary-search.

Little-Wallace avatar Aug 02 '22 13:08 Little-Wallace

@BugenZhao I find that this bug may be caused by https://github.com/singularity-data/risingwave/pull/4045 with binary-search.

Oops. Any further investigations?

BugenZhao avatar Aug 02 '22 16:08 BugenZhao

@BugenZhao I find that this bug may be caused by https://github.com/singularity-data/risingwave/pull/4045 with binary-search.

I've reviewed the implementation of remote input and found that creating client with moka might be suspicious, considering the bug we found in our LruCache last time. I've not checked how moka dedups inflight requests yet.

https://github.com/singularity-data/risingwave/blob/f5ab9b4e561ba2a2e8aaff062db9dc0fde44ecb4/src/stream/src/executor/exchange/input.rs#L125

A quick PoC might be creating the client outside of the stream and check whether there're still problems.

BugenZhao avatar Aug 02 '22 16:08 BugenZhao

Was it fixed by https://github.com/singularity-data/risingwave/pull/4505 ?

fuyufjh avatar Aug 09 '22 15:08 fuyufjh