risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

do not unbounded wait for inexistent table ids in filter key extractor

Open BugenZhao opened this issue 3 years ago • 1 comments

so it will stuck for lack of Source catalog.

I think the design of unbounded waiting for notification might be problematic.

We've also encountered this issue multiple times when developing just because of some forgetting the catalog. It is a bug, but it looks so weird to developers.

Indeed, anther bug found in #5192 is also caused by this. Committing the epoch after barrier collected will hang, this will make all barriers unfinished or buffered in queue. So both ddl and ckpt barrier will hang.

The real problem is that Compute and Compactor need all state table catalogs and epoch commit rely on this. But for now only state table of SourceExecutor don't have any catalogs and won't be notified. After source state table get supported and our deterministic test is stable, we can avoid this issue. BTW, can we expose the problem faster when committing epoch in anther way like panic for example, rather than hang forever? Cc @Li0k

Originally posted by @yezizp2012 in https://github.com/risingwavelabs/risingwave/issues/5359#issuecomment-1248947585

https://github.com/risingwavelabs/risingwave/blob/fd534ea5c4f90e04b6d0a2a08aec38432f148b1f/src/storage/hummock_sdk/src/filter_key_extractor.rs#L284-L304

BugenZhao avatar Sep 20 '22 05:09 BugenZhao

Agree, this is a fallible point and not easy to observe, some simple ideas:

  1. notify does not provide a timeout, we can spwan a dedicated task for observing the notification timeout and panic it ?
  2. can we use stack_trace to add span for notify?

Li0k avatar Sep 20 '22 06:09 Li0k

can we use stack_trace to add span for notify

Yes, but this should be treated as the fallback solution. It does become easy to debug deadlocks or stuck in our system after we introduced the async stack trace, but I think we should not rely on it and should redesign some parts if needed.

BugenZhao avatar Sep 27 '22 05:09 BugenZhao

@Li0k Is this done? Can we close this issue?

hzxa21 avatar Nov 10 '22 07:11 hzxa21

@Li0k Is this done? Can we close this issue?

fuyufjh avatar Nov 21 '22 08:11 fuyufjh

@Li0k Is this done? Can we close this issue?

yes, use timeout for notify to replace the unbound wait. This may not be the best solution, but we can close this issue first

Li0k avatar Nov 21 '22 09:11 Li0k