dragonboat Add FailHandler to node

Context in this issue https://github.com/lni/dragonboat/issues/381. In short: I have my state machines on different disks, and I need the ability to stop one if a disk fails. Currently, there’s no way to do that — any error returned from the SM leads to a panic inside Dragonboat. @kevburnsjr proposed adding an error handler to the node: to stop the replica on error and then call the handler out of band. Here are the changes:

Added the FailHandler function in the node config. The proposed name in the issue was Recover, but it’s a bit confusing, in my opinion.
Errors returned from IOnDiskStateMachine are wrapped.
Changes to the node:
- The node checks if the error originated in the user’s state machine and if a FailHandler is present. If so, the node does requestRemoval and saves the error. The node returns no error to the engine in this case. Engine workers shouldn’t start new tasks on the node since it is stopped().
- FailHandler is called with the saved error once the node is fully offloaded.
Added test cases for errors from all SM methods.

I haven’t tested this change in production yet; I’ll report once I do.

Please let me know what you think.

Aug 11 '25 16:08 i512

This is great. Appears to require fewer changes than I expected.

One comment about naming:

Fail is a verb so the name FailHandler is a little confusing semantically.

What do you think about renaming the config option to something more general like FaultHandler or ErrorHandler since the words Fault and Error represent a superset of both recoverable and unrecoverable events where Failure typically refers only to unrecoverable events?

Article: Fault and Failure in Distributed Systems

If a fault is recoverable, the handler might repair some corrupted rows and restart the replica making the system fault tolerant. If a fault is permanent, the handler might mark the replica as failed, delete all associated data and reschedule it elsewhere.

I think it might be valuable to retain the distinction between recoverable and unrecoverable errors by reserving the word Fail for external use where the calling code is responsible for deciding whether a fault should be escalated to a failure.

Is there any specific reason you chose to implement this only for OnDiskStateMachine and not for InMemStateMachine or ConcurrentStateMachine?

As an example of where this might be useful: I'm presently creating a system that encapsulates state machine logic in webassembly modules. Some of these are in memory state machines. If the WebAssembly module panics during execution, I might want to stop the replica rather than ignoring the proposal or crashing the host.

Are you worried that users might get confused when they pass a FailHandler to the host in a node config when starting an in memory statemachine but the fail handler doesn't ever get called and the start state machine function also doesn't return an error stating that FailHandler is not supported for non-disk state machines?

Aug 13 '25 17:08 kevburnsjr

@i512 thanks for the PR, but I am a little bit confused, can I just ask what is the benefits for having such handler and why it can't be done in your application layer?

Aug 14 '25 05:08 lni

@i512 thanks for the PR, but I am a little bit confused, can I just ask what is the benefits for having such handler and why it can't be done in your application layer?

Sure, a possible scenario: Dragonboat issues an Update on the state machine, and the disk that this state machine saves data to has just entered RO mode. So we can't continue working on this shard for the time being. How can we stop this SM without causing a panic so that we don't disturb other shards on this NodeHost?

We can issue StopShard on the NodeHost; however, Dragonboat still expects us to return from the Update.

If we return an error from Update, then Dragonboat will still panic.
If we mask the error—return nil—then Dragonboat might record the new applied index. If the disk ever comes back and we try to start the replica, the applied index of the state machine will be lower than what Dragonboat expects, and it will rightly panic.

Not sure what else we can do in the application layer besides just hanging in Update forever. But this also hangs an apply worker, and the NodeHost also can't be stopped (it waits for apply workers to exit).

So, I don't think it is currently possible to handle it in the application layer.

Aug 14 '25 15:08 i512

@kevburnsjr thanks for your feedback!

Fail is a verb so the name FailHandler is a little confusing semantically.

What do you think about renaming the config option to something more general like FaultHandler or ErrorHandler since the words Fault and Error represent a superset of both recoverable and unrecoverable events where Failure typically refers only to unrecoverable events?

I wasn't aware of this semantic difference. Sure, I like FaultHandler. Will update the PR now.

Is there any specific reason you chose to implement this only for OnDiskStateMachine and not for InMemStateMachine or ConcurrentStateMachine?

Yes, this is intentional. I did not think that there was any valid reason for an update on an InMemStateMachine to fail besides a logic error, in which case it is probably better to panic.

I'm presently creating a system that encapsulates state machine logic in webassembly modules. Some of these are in memory state machines. If the WebAssembly module panics during execution, I might want to stop the replica rather than ignoring the proposal or crashing the host.

It is a reasonable example, if you are loading these state machine dynamically. I can update the PR to work for all SM types. It would be interesting to know more of your use case.

Aug 14 '25 15:08 i512

hi @i512

Thanks for the update above. I like your suggested scenario. Please allow a couple days so I can double check is there anything else still missing, I will get back to this PR during the coming weekend. Hopefully we can get it merged very soon.

Aug 15 '25 01:08 lni

@i512

first of all, you have a valid scenario as you explained above, I agree that dragonboat should be updated for handling such case. below are my questions -

we can define a special error type (say ErrStateMachineInFaultState) to indicate that the user state machine had some errors so the affected replica should be stopped without impacting other shards running on the same node. user state machine code can just return that ErrStateMachineInFaultState to avoid the panic & making sure that particular update won't be marked as applied. Such returned ErrStateMachineInFaultState error will be included in a new field added to the statemachine.Result. On receiving such returned ErrStateMachineInFaultState, user applications are free to do whatever necessary to handle the fault.

will this approach also work for the scenario you are trying to handle? Is there any other situations that can't be handled by it but can be better handled by your proposed faultHandler approach?

thanks.

Aug 17 '25 11:08 lni

It is possible that a proposal to a shard may fail for 1 replica (on a node with a full disk) and succeed for the other 2 replicas. In that case the proposal would be accepted and applied and the client need not be aware of how many hosts acknowledged the write or whether any encountered a fault.

Stopping the replica and firing a callback on the host where the fault occurred seems like the simplest solution to me since the host from which the proposal was issued is not necessarily the same host as where the fault occurred.

Returning the error in sm.Result forces the user application to build an out-of-band protocol for resolving replica state machine faults whereas if the callback is triggered on the host with the failed replica, it may be resolved without coordination.

Also, if a replica encounters a fault and returns the error with the state machine result across a network to the leader, that message may never be received if a network disruption or request timeout occurs. A local callback won't have to account for network partitions in triggering immediate recovery. If the replica is restarted after a fault without any action taken, the replica is likely to reproduce the fault again when it reapplies the log, ensuring that the user application receives the fault at-least-once rather than at-most-once.

Given that statemachine.Result represents the result of applying a proposal to a shard, it doesn't seem appropriate to me to return replica level errors like ErrStateMachineInFaultState as part of the result. Given a 5 node cluster where 2 nodes encounter a fault, which fault is returned with the successful result? Would it be a slice of errors?

The important thing to me is that returning a particular error type from a state machine update method stops the replica and prevents the host from panicking. With this functionality alone, user applications could wire their own recovery path by passing messages out of the state machine through a local error channel to the user application. The FaultHandler config option just serves as a standard fault recovery mechanism so that user applications don't have to build their own error channels into their state machines.

Use of a standardized fault recovery mechanism like this might make it easier for users to identify which state machines and user applications support fault recovery and which do not.

Aug 17 '25 16:08 kevburnsjr

Hi @lni, thank you for your reply!

Such returned ErrStateMachineInFaultState error will be included in a new field added to the statemachine.Result. On receiving such returned ErrStateMachineInFaultState, user applications are free to do whatever necessary to handle the fault.

will this approach also work for the scenario you are trying to handle? Is there any other situations that can't be handled by it but can be better handled by your proposed faultHandler approach?

@kevburnsjr makes some great points about this.

I would also add that this approach won't work if the disk fails when the state machine is in other methods, which will cause it to return an error and lead to a panic inside Dragonboat. These are called by Dragonboat internally and do not necessarily have a corresponding request: Sync, PrepareSnapshot, SaveSnapshot, RecoverSnapshot. There's no way to return an error if there was no request.

I think the fault handler function is the simplest and most uniform way to handle state machine errors.

we can define a special error type (say ErrStateMachineInFaultState) to indicate that the user state machine had some errors

I do not have a strong opinion about this, but I don't think that a special error type is necessary, as in the current state machine API, any returned error can be considered ErrStateMachineInFaultState. In my case, for example, I would have to wrap all errors in all state machine methods.

Aug 20 '25 11:08 i512

any returned error can be considered ErrStateMachineInFaultState

I agree. Any replica started with a fault handler defined could expect all errors returned by a state machine to be directed to the fault handler. This would be backward compatible. Existing implementations would continue to panic since they don't specify a fault handler and no special error type would be necessary.

Aug 20 '25 15:08 kevburnsjr

Another thought...

Might it be possible for a user to build their own fault handler using only ISystemEventListener.NodeUnloaded()? The user would just need to keep track of the error returned by the state machine and make it available to the system event listener.

In that case, the only change to the config struct for replicas might be a boolean:

Type Config struct {
	
	// ...
	
	// NoPanic causes replicas to be stopped rather than panicking when a state machine returns an error.
	// These errors may then be handled in [raftio.ISystemEventListener.NodeUnloaded].
	NoPanic bool
}

Where callbacks are not called by the engine. The engine just stops the replica and that's all it does.

This seems like the least intrusive interface change.

Aug 21 '25 01:08 kevburnsjr

Hey, @kevburnsjr

The user would just need to keep track of the error returned by the state machine and make it available to the system event listener.

This would require users to add this error handler code to almost all state machine methods, which is not hard but is noisy.

This seems like the least intrusive interface change.

Well, isn't it almost the same: a boolean config value vs. a func? An empty FaultHandler func would work the same as NoPanic. I like your original idea a bit more.

Aug 22 '25 11:08 i512

The only difference between a boolean and an interface is that a boolean is serializable while an interface is not so I consider it less intrusive. This alternative was just a thought so I figured I would present it to ensure all possibilities are considered.

I don't have any strong opinions about how the panic avoidance ought to be implemented, I just really want to see it merged to master in some usable form.

Aug 22 '25 16:08 kevburnsjr