risingwave
risingwave copied to clipboard
do not unbounded wait for inexistent table ids in filter key extractor
so it will stuck for lack of
Sourcecatalog.I think the design of unbounded waiting for notification might be problematic.
We've also encountered this issue multiple times when developing just because of some forgetting the catalog. It is a bug, but it looks so weird to developers.
Indeed, anther bug found in #5192 is also caused by this. Committing the epoch after barrier collected will hang, this will make all barriers unfinished or buffered in queue. So both ddl and ckpt barrier will hang.
The real problem is that Compute and Compactor need all state table catalogs and epoch commit rely on this. But for now only state table of SourceExecutor don't have any catalogs and won't be notified. After source state table get supported and our deterministic test is stable, we can avoid this issue. BTW, can we expose the problem faster when committing epoch in anther way like panic for example, rather than hang forever? Cc @Li0k
Originally posted by @yezizp2012 in https://github.com/risingwavelabs/risingwave/issues/5359#issuecomment-1248947585
https://github.com/risingwavelabs/risingwave/blob/fd534ea5c4f90e04b6d0a2a08aec38432f148b1f/src/storage/hummock_sdk/src/filter_key_extractor.rs#L284-L304
Agree, this is a fallible point and not easy to observe, some simple ideas:
- notify does not provide a timeout, we can spwan a dedicated task for observing the notification timeout and panic it ?
- can we use stack_trace to add span for notify?
can we use stack_trace to add span for notify
Yes, but this should be treated as the fallback solution. It does become easy to debug deadlocks or stuck in our system after we introduced the async stack trace, but I think we should not rely on it and should redesign some parts if needed.
@Li0k Is this done? Can we close this issue?
@Li0k Is this done? Can we close this issue?
@Li0k Is this done? Can we close this issue?
yes, use timeout for notify to replace the unbound wait. This may not be the best solution, but we can close this issue first