Need a recovery program from a power cycle
A few times, an experiment has been prematurely halted due to a power cycle on a Pi. There are a few situations to consider:
- Power cycle on a leader also a worker.
- Power cycle on a worker only.
- Power cycle on both the leader and worker.
Ideally, a power cycle occurs, and the machine is in the same, or near-same, state as previous to the cycle. This isn't always wanted though, so maybe it's configurable (i.e. I pull the plug to disconnect some bad action, I don't want that action to start again when I turn it back on).
Related to #107
Alternatively, there is an alerting system when the RPi cycles unexpectedly mid-experiment, so the user can manually intervene.
The current position is not good: when the leader goes down, the broker looses all the LWT (whyyyyy), so we need a way to correct this. Solutions:
- [x]
monitordoes a correction: gets latest experiment, looks for jobs with state="ready", and compares against what is currently running on the machine. Differences get a "lost". - [ ]
watchdogdoes some sort of ping to all MQTT-active jobs?
Using a script like the bottom of this page: https://ubuntuforums.org/showthread.php?t=1621039, allows us to know if the shutdown was graceful or not.
monitor does a correction: gets latest experiment, looks for jobs with state="ready", and compares against what is currently running on the machine. Differences get a "lost".
We removed this, but I think we should add this functionality back?
When a leader/worker cycles
Here's an idea that would help users avoid the "UI state doesn't match system state" that is caused by MQTT: don't persist MQTT messages to disk. What are the current topics / messages that need to be persisted - and move that data to another storage.
Not persisting messages means future developers need to find another solution (like local_persistant_storage) which is a good pattern.