fault tolerance: need error propagation analysis
This will be likely to be broken into multiple issues but I wanted to open this to remember this important items as we will work on "stabilization" tasks towards a tape out. Within resource and qmanager, there are some RPCs that leave the internal states inconsistent when a failure occurs. We need to analyze this more closely and have a clearer error handling semantics.
There are several call sites where return codes are not checked:
https://github.com/flux-framework/flux-sched/blob/master/resource/modules/resource_match.cpp#L461
There are a few error paths where errno is not preserved. We need to save and restore errno for library calls (e.g., json_decref) being made on the error paths.
https://github.com/flux-framework/flux-sched/blob/master/resource/writers/match_writers.cpp#L261
From #679:
There are some error paths within the sched-fluxion-resource module that are allowing errors to pass by silently and responding to the request RPC with a successful response. In particular, this happens when the writer fails to emit properly. We should make these errors loud and respond to the RPC with an error
EDIT: we should also decide how we want to recover from the above failure. Since technically the allocation for the job has already been made in fluxion-resource. Do we want to automatically rollback the allocation, or let the requesting client handle the cancellation/rollback?
Here is an additional problem:
Our DFU traverser concatenate one or more error strings to its err_message string member so that the upper layer can use get_err_message() to print it. There are several place which err_message string added has the newline character in the end, which doesn't work well with flux_log_error. An example:
ahn1@5b12c7ea7263:/usr/src$ python3 t/scripts/flux-ion-resource.py find status=adown
2020-07-15T17:03:56.268829Z sched-fluxion-resource.err[0]: run_find: find: invalid criteria: status=adown.
2020-07-15T17:03:56.268864Z sched-fluxion-resource.err[0]: : Invalid argument
Look at the extra colon in front of "Invalid argument". We will need a way for the upper layer to iterate each error string to print out properly.
This may also give a better way to resolve one of the pending issues: #409.
I want to spend a bit more time for this. Targeting Sep release.