flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

fault tolerance: need error propagation analysis

Open dongahn opened this issue 5 years ago • 5 comments

This will be likely to be broken into multiple issues but I wanted to open this to remember this important items as we will work on "stabilization" tasks towards a tape out. Within resource and qmanager, there are some RPCs that leave the internal states inconsistent when a failure occurs. We need to analyze this more closely and have a clearer error handling semantics.

dongahn avatar Mar 07 '20 07:03 dongahn

There are several call sites where return codes are not checked:

https://github.com/flux-framework/flux-sched/blob/master/resource/modules/resource_match.cpp#L461

dongahn avatar Jun 11 '20 04:06 dongahn

There are a few error paths where errno is not preserved. We need to save and restore errno for library calls (e.g., json_decref) being made on the error paths.

https://github.com/flux-framework/flux-sched/blob/master/resource/writers/match_writers.cpp#L261

dongahn avatar Jun 11 '20 04:06 dongahn

From #679:

There are some error paths within the sched-fluxion-resource module that are allowing errors to pass by silently and responding to the request RPC with a successful response. In particular, this happens when the writer fails to emit properly. We should make these errors loud and respond to the RPC with an error

EDIT: we should also decide how we want to recover from the above failure. Since technically the allocation for the job has already been made in fluxion-resource. Do we want to automatically rollback the allocation, or let the requesting client handle the cancellation/rollback?

SteVwonder avatar Jul 08 '20 19:07 SteVwonder

Here is an additional problem:

Our DFU traverser concatenate one or more error strings to its err_message string member so that the upper layer can use get_err_message() to print it. There are several place which err_message string added has the newline character in the end, which doesn't work well with flux_log_error. An example:

ahn1@5b12c7ea7263:/usr/src$ python3 t/scripts/flux-ion-resource.py find status=adown
2020-07-15T17:03:56.268829Z sched-fluxion-resource.err[0]: run_find: find: invalid criteria: status=adown.
2020-07-15T17:03:56.268864Z sched-fluxion-resource.err[0]: : Invalid argument

Look at the extra colon in front of "Invalid argument". We will need a way for the upper layer to iterate each error string to print out properly.

This may also give a better way to resolve one of the pending issues: #409.

dongahn avatar Jul 15 '20 17:07 dongahn

I want to spend a bit more time for this. Targeting Sep release.

dongahn avatar Aug 31 '20 16:08 dongahn