Prevent Node-RED from Entering Safe Mode After Multiple Restarts
Description
Currently, after multiple restarts, Node-RED automatically enters safe mode with the message: “Node-RED restart loop detected. Restarting in safe mode.” This can result in extended downtimes, where flows are not running, particularly during off-peak hours such as nighttime. This is problematic as it can lead to critical services being offline for several hours.
Request: As an administrator, I would like the ability to configure a parameter (preferably as an environment variable) to prevent Node-RED instances from starting in safe mode after restarts. This will ensure that after a restart, flows remain active and the instance continues functioning without requiring manual intervention.
Expected Benefit: By introducing this configuration option, I will be able to minimize downtime and avoid lengthy periods where flows are not running, especially during unattended periods like overnight restarts. This ensures continuous service availability and improves overall system resilience.
Which customers would this be available to
Team + Enterprise Tiers (EE)
Have you provided an initial effort estimate for this issue?
I am not a FlowFuse team member
@muenir thanks for raising this. Just to check the details here, we put into safe mode, when we've detected multiple hangs/restart loops to prevent this continuing infinitely.
Whilst we can offer the configuration option here to disable that (or configure more detail on when that safe mode is enabled), I'm struggling to see the value in turning it off entirely as your application will just continue to crash/loop? Or do you expect it to auto-recover at some point?
I understand your point. In some cases, multiple restarts may happen due to temporary issues, such memory issues/leaks. It’s also possible that a specific part of the flow is triggered at certain times, causing a restart (e.g.,buggy custom or function node). Since we’re already detecting restarts, we can respond to them more effectively. However, the extended downtime of flows, especially overnight, is a significant concern. Again, this would be just an optional and even temporary flag that would be set ...
Tasks
- [ ] Update FlowFuse to present and persist bootloop settings
- [ ] Update launcher to accept and use bootloop settings
- [ ] Update docker driver to get settings for project and pass to launcher
- [ ] Update k8s driver to get settings for project and pass to launcher
- [ ] Update local-fs driver to get settings for project and pass to launcher
@hardillb does this look about right (or can you think of a means of getting settings from platform to instance without having to alter all the drivers?)
@Steve-Mcl shouldn't need any changes to the drivers, all this should be handled by the nr-launcher and contained in the existing settings bundle
hmmm, but then how do we get new options (un-yet defined in forge) across to the launcher? (sorry, I am clearly rusty in the area - pointers apprciated)
Instance/id/settings api
Thanks Ben you helped me find the code I was looking for - it was 10 lines below where I already added the new variables! DOH!
### Tasks
- [ ] Update FlowFuse to present and persist bootloop settings
- [ ] Update launcher to accept and use bootloop settings (default to existing constant)
@joepavitt
This issue is scoped as
Which customers would this be available to
Team + Enterprise Tiers (EE)
However, that would make implementation a lot more difficult and TBH, I see no value in only supporting this option in team+enterprise.
Are you happy for me to scope this to "CE/EE"?
@Steve-Mcl can we not just hide the option based on a feature flag in the team type? We have a number of these options already
@Steve-Mcl can we not just hide the option based on a feature flag in the team type? We have a number of these options already
Sure, but I dont really see this as a "value add" & doing this just over complicates it IMO.
Also, I would need to either ignore the setting in the launcher (based on tier) or inhibit sending it to launcher & defaulting to "off".
PS, this was not scoped by FF personnel. If I were the person scheduling this work, I would most likely have chose CE/EE - so just want clarification from Joe.
Scope changed to CE/EE in agreement with Joe.
Underlying tasks completed - closing issue