rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

RFC: Suspend MV on Non-Recoverable Errors

Open hzxa21 opened this issue 2 years ago • 5 comments
trafficstars

Rendered Link

hzxa21 avatar Feb 24 '23 12:02 hzxa21

I think we may extend the scenarios to unrecoverable Sources (https://github.com/risingwavelabs/risingwave/issues/7192). The suspend workflow should be similar from the perspective of Meta. cc @BugenZhao

StrikeW avatar Feb 27 '23 12:02 StrikeW

After the dicussion, I'm a little wavering now.

  1. It introduces some extra complexity, not only in programing but also in analyzing a bug afterwards.
  2. Sometimes it's hard to tell exactly whether a situation is non-recoverable or recoverable

fuyufjh avatar Mar 06 '23 08:03 fuyufjh

  1. It introduces some extra complexity, not only in programing but also in analyzing a bug afterwards.
  2. Sometimes it's hard to tell exactly whether a situation is non-recoverable or recoverable

This sounds just the original reasons why we didn't do it..

xxchan avatar Mar 06 '23 12:03 xxchan

We haven't reach consesus during the meeting but many thoughts pop out.

Goal

  1. Let user aware of the happening of the errors
  2. Maintain correct results of the MV/Sink when non-recoverable errors happen
  3. Reduce the impact of non-recoverable errors for streaming and serving

Meeting summary and some thoughts afterwards

  • Goal #1 can be achieved via reporting errors in grafana/cloud portal. We can also display the errors in a system table or display them on batch query if we want to draw attentions from the users.
  • Goal #2 (strict correctness - considering both MV and Sink) can be achieved via suspending the MV as well as all its downstream MVs but it hurts availablity of a subset of the streaming jobs.
  • Goal #3 can be achieved via
    • Not suspending any MVs when errors happen but just skipping the errors or filling in default value. Availabilty is high but this hurts goal #2.
    • Suspending only the affected MVs. Availibility is downgraded but goal #2 can still be achieved.
  • Other thoughts brought up in the meeting:
    • Errors/panics caused by kernel bugs may be recoverable by upgrading the codes with the fixes and resume MVs from the lastest checkpoint.
    • We can provide an option to the user to choose whether he/she wants to suspend MV on non-recoverable errors but what the default behavior should be is still debatable.
    • We also talk about whether it is possible to keep the temporal join MV running if only its inner side MV is suspended. IMO, it is doable but complicated. We need atomic rename for MV so that the inner side MV can be recovered.
    • Not all panics are non-recoverable. Sometimes developer may just panic on meeting recoverble errors without thinking all through. In such cases, recovery may help.

In fact, goal #2 contradicts with goal #3 and user may pay different attentions depending on their use cases or development stages:

  • In test environment or use cases that cannot tolerate incorrect results (e.g. sinking a wrong signal to external system may not be acceptable), use may want to catch the non-recoverable errors and fix them immdieately before bad things happen. In this case, correnectness is preferred over availibility.
  • In production or use cases that can tolerate incorrect results, user may want its service to be always up. In this case, MV suspension may be a concern since availibility is preferred over strict correctness.

hzxa21 avatar Mar 06 '23 14:03 hzxa21

In production or use cases that can tolerate incorrect results, user may want its service to be always up. In this case, MV suspension may be a concern since availibility is preferred over strict correctness.

This is the key concern of this RFC. However, in the current implementation, even though incorrect result is acceptable by the user, not all erros can be skipped. Let's assume all the errors can be skipped are recoverable errors. I suggest we constrain our discussion and focus on the following two issues:

How to identify the errors that cannot be skipped?

If we cannot identify these errors, we cannot do anything about it. The main debates are about how to handle panic.

The only feasible option seems to be letting developer carefully identify non-recoverable panic and report it as an non-recoverable error because as mentioned in the comment, we can hardly know the origin actor of a panic.

Example: we already did the conversion from panic to error for the MemTableError::InconsistentOperation and this error is non-recoverable: https://github.com/risingwavelabs/risingwave/blob/f5e41a190b284b26a83b7740af77c842d5e35837/src/storage/src/mem_table.rs#L99

What should we do when encountering errors that cannot be skipped?

Possible options:

  1. Trigger recovery over and over again, hoping that recovery can fix the errors.
  2. Pause all MVs and tell users to intervene. From user points of view, this is equivalent to option 1 if recovery cannot fix the errors (endless recovery == pausing all MVs).
  3. Pause some MVs and tell users to intervene. This is the proposal from this RFC.

hzxa21 avatar Mar 06 '23 15:03 hzxa21