system_modes
system_modes copied to clipboard
Layered handling of node and (sub-)system errors
from (#47 )
This is in the context of our exemplary case of the
laser_driver
error. We want to elaborate on the layered approach we discussed in the last MROS meeting. This is how I interpret our desired design (please comment if something is not correct or clear):
- First the
laser_driver
code for handling errors tries to recover from the error in theErrorProcessing
transition state.(from here it is a related but different issue)
- If it does not succeed (I guess that means node does not transition to
Active
), theModeManager
tries to recover from the error using thefeature/rules
. For this, @jginesclavero is adding a rule in the SystemModes file of our system.- If there is no rule, or there is but after applying it the alternative
MODE(s)
of thelaser_driver
are not reached either, theModeManager
reports to theMROS Metacontroller
that the corresponding (sub)system(s) MODE(s) are not reachable. (see issue for the continuation of the handling of errors at the higher layers)
continuation
Currently this will be implemented in a passive way, by offering that information (see https://github.com/micro-ROS/system_modes/issues/43)
But, since the current target MODE
cannot be reached... we were thinking (in a discussion with TUD and URJC) if the ModeManager
should report this actively system wide, for the operator or any supervisory system (e.g. MROS Metacontroller
) to handle it.
Proposal: Since not being able to reach the target MODE
is a deviation of expected and desired behaviour, we propose that the ModeManager
uses diagnostics
to report this. The MROS Metacontroller
will subscribe such diagnostic messages.
(@fmrico @jginesclavero @marioney please comment if I missed something or did not convey it correctly)
What do you think @norro ?
What the mode manager will actually already sense is the deviation between the requested state/mode and the actual state/mode. This is not yet merged to master, but available in the feature/rules
branch, because it is necessary in order to decide when to apply rules. See feature/rules:mode_inference.cpp.
Reporting these deviations to diagnostics is an interesting idea.
This is again a question of timing, though. When a state/mode transition is requested, there is always and immediately a deviation, since systems/nodes will take some time to perform the transition. So the mode manager will have to decide, when to report the deviation, i.e. when to assume that the transition takes to long and the deviation therefore can be considered an erroneous deviation. Do you have an idea how/when to do this? After half a second? A second? ... @chcorbato
Suggestion:
- When a deviation is detected, wait a certain time
t_0
before considering it an erroneous deviation - After
t_0
, try to apply a rule, if an appropriate rules exists. If no rule exists, try to recover the node/system - Wait a certain time
t_1
and if nothing happened, report the erroneous deviation, e,g., through diagnostics
(t_0
and t_1
have to be configurable obviously)
/cc @chcorbato @ralph-lange
I like very much your suggestion of a configurable time limit for each management layer!
Do you have suggestions for these times in the case of navigation2 @fmrico @marioney @jginesclavero @lbajo ?
@chcorbato The feature/rules
is merely a micro-ROS experiment by now btw. For "2. If it does not succeed [...] tries to recover from the error using rules" I consider metacontrol (reconfiguration actions?) in charge.
We are even happy to drop the system modes rules feature completely once the metacontrol part for this task is integrated with system modes.
I see. Currently @jginesclavero is trying to get results with that feature this week, by adding such a rule in Pilot-URJC system model.
I propose we keep this test for this week and analyse the result afterwards (usefulness, problems...) to then make an informed decision to move the feature to the metacontrol part.
What do you think @norro @jginesclavero ? @norro are you available to keep supporting @jginesclavero on this today and tomorrow?
Yes, I am available today and tomorrow to help with upcoming issues.
Hi @norro @chcorbato !
I was testing the feature/rules
branch and it works as we expected. In short, I have defined a rule that changes to DEGRADED
mode (navigation with pointcloud_to_laser) if the laser_driver is not in active
state. The mode is changed immediately, works really nice.
I have done some navigation tests where I force a laser failure and the mode change correctly, the laser is replaced by the pointcloud and the navigation continues.
Do you have suggestions for these times in the case of navigation2 @fmrico @marioney @jginesclavero @lbajo ?
From the metacontroller point of view, the reasoning cycle is very slow (about 2 sec) so we're safe with half of that I guess. I'm not sure how that time affects the navigation 2, but I'm guessing it does not.
Closing this issue soon as it has successfully been shown in the MROS pilots.