plugins icon indicating copy to clipboard operation
plugins copied to clipboard

Feature Request -- Safe Mode (Auto Rollback Changes if connection lost or not re-established within X time)

Open tcsi-github opened this issue 3 years ago • 41 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
  • [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

I realize I am duplicating a post, however, I assumed when I created a reply it would re-open the closed issue. That appears not to be the case and I simply hoped that opening a new issue would generate more feedback than a closed one. Should the original issue be able to be re-opened I am happy to close this.

Is your feature request related to a problem? Please describe.

I have multiple sites that am having to make several changes that could break access (Network address changes, Firewall changes, Routing, Etc.) ........ I'm scared. HAHA!

The remote sites have 0.75% technical experience. Yes, I might be able to find someone to power cycle a router. However, I've also worked as an MSP and had the resident "IT person" unplug the network cable to "reboot" which they thought worked because "the lights went off and came back on" (Link Lights) #facepalm

Needless to say, took me a bit to figure out that's what they were doing.

Describe alternatives you considered

To my knowledge OPNSense doesn't have any reasonable alternative to this particular problem. If I am wrong, my apologies for wasting time and please help direct me in a correct path.

Additional context

Originally posted by @tcsi-github in https://github.com/opnsense/core/issues/3042#issuecomment-1119297907

I also created a forum post regarding this but no reply as of yet. https://forum.opnsense.org/index.php?topic=28238.0

@banym had a good template as well and I include his here for comparison. Personally, I feel like the simpler thing would be to take a snapshot as soon as the breaking change option is enabled. I can see the use in giving the option for a timer, however, I would think setting a "middle-road" 120 seconds would be a default.

Describe the solution you'd like It would be nice to lock the firewall in a "major change" mode where only one session is able to do changes until the major change mode is exited. This mode should be able to define a working configuration from the backup config history or the current configuration when this specific mode is activated. It should be possible to set a timer for change commitment. Now the administrator can for example make significant changes to routing or rules that possible could lock him out of the firewall. If he does not approve that his change was successful and works as intended the firewall roles back to the defined configuration. This way the administrator can log in an "try" again or rethink his change

Describe the solution you like

I was thinking about my MikroTik days and remember that if I happened to make a incorrect change that those changes would be reverted if I failed to connect back or didn't apply them within X time. Sometimes it was a pain, but it saved my butt so so many times. Kept me from having to make calls to talk someone through a reboot and just gave me a bit of room to breathe.

I would submit something the below as a rough draft for an option.

The ability might stay disabled by default and only enabled prior to the change. This would allow the option to stay out of the way and only be used when explicitly needed.

@fichtner -- I believe this to be a worthwhile addition to the OS and would be a step in the direction of more enterprise use-cases. My development skills are limited, however, I would commit to time testing and anything within my abilities to help this become a actual feature.

I would welcome any thoughts or feedback on the suggestion.

Example Change ---> Admin has to change a WAN IP

breaking_change_enabled

Scenario - Successful Change

  1. Admin logs in and enables "Breaking Changes Mode" (Or whatever better name)
    1. The working config is immediately snapshot.
    2. Notification alert is placed at the top bar
    3. No other logins are allowed but the user
    4. Working config is marked as the config to be used for restoration
    5. Perhaps the connection is marked for monitor?
    6. Timer Starts (120 second?)
    7. Notification Alert shows at top (See above)
  2. Admin makes changes just as they normally would
    • Complete with "Apply Changes", etc.
  3. Tests changes and is happy with results
    • Connection either not broken or reconnected within timer
  4. Admin disables "Breaking Changes Mode"
    • Previous snapshot config is then removed to prevent use on reboot.
    • System returns to normal operation
    • Notification removed from top bar
Now lets try a screw-up....

Scenario - Failed Change

  1. Admin logs in and enables "Breaking Changes Mode" as before
    1. Same as above happens to enable feature
  2. Admin makes changes just as they normally would
    • Complete with "Apply Changes", etc.
  3. Finds they have made a mistake and are unable to connect back to the GUI
    • Connection timer expires, or some other defined trigger
      • "Breaking Changes Mode" begins
        1. Previous Snapshot is automatically restored
        2. Router is Rebooted (if needed)
        3. Other actions such as logging the issue, etc...
  4. Admin wait for completion of restore - Perhaps an email or other notification could be sent notifying all is well again.
  5. Admin is able to start again, thankful they have not created a bigger problem requiring an on-site visit (For me, an EXPENSIVE multiple day trip)

tcsi-github avatar May 06 '22 12:05 tcsi-github

Thank you for creating an issue. Since the ticket doesn't seem to be using one of our templates, we're marking this issue as low priority until further notice.

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

The easiest option to gain traction is to close this ticket and open a new one using one of our templates.

OPNsense-bot avatar May 06 '22 12:05 OPNsense-bot

I would love this in opnsense! I loved using commit confirmed 10 in Juniper.

aboutte avatar May 06 '22 15:05 aboutte

Same. I use reload in 5 often in switch or router configurations that are remote. Saves your butt when you misconfigured an ACL.

SCUR0 avatar May 06 '22 16:05 SCUR0

Thats why auto sync is disabled when using a HA setup ;)

mimugmail avatar May 06 '22 17:05 mimugmail

@mimugmail , I can see that, however forgive me, are you saying that HA is the only option where this would be considered?

Apologies, sometimes I'm a bit dense. :-)

tcsi-github avatar May 06 '22 17:05 tcsi-github

Honestly I have no idea how to implement such a huge change in an easy way, the HA setup is already here, stable, easy to understand :)

mimugmail avatar May 06 '22 18:05 mimugmail

Forgive me if I come off wrong, that is not the intent. I'm not entirely sure how to answer this comment correctly.

While I agree that HA does have it's place, I feel like this is one, very unnecessary overhead you are suggesting, and two, this isn't exactly solving the same problem.

HA, to my understanding, requires at least duplicating hardware to achieve sync (I also thought it requires multiple public IPs but I'm not as sure over that). Apologies, but I'm not going to spend another $600+ for identical hardware and additional power consumption just for a misconfiguration. I would be time out of pocket but I could justify a couple of "mistake trips" and it be cheaper as opposed to the cost of duplicate hardware.

HA solves for availability and redundancy, which granted, is covered by the solution. However, I am wanting a solution for recoverability only. If the hardware dies, fine.... in a HA setup I'd still have to drive out and replace the bad hardware, I would just still be running.

Also, I would like to understand the complexity of the request. I don't quite see how this is a huge change when the underpinnings are already there. We have the ability to backup / restore configs, I am simply suggesting a means of enabling a system that does it automatically. It could be expanded and made much more, but at the heart, that's what I think most everyone would agree they wanted. Saved point and a auto restore if there was an issue.

I do realize there is probably more to consider so I would welcome someone helping me understand more. :-)

I hope that came off in the open spirit of debate rather than rude.

tcsi-github avatar May 06 '22 19:05 tcsi-github

There's just no generic concept of "configure and commit all pending changes" in a reliable way, which always will make such a feature incomplete and disappointing when really needed. You can in theory offer functionality like this on a per component level, which is also what we did for the firewall api (https://docs.opnsense.org/development/api/plugins/firewall.html#concept).

It's rather simple, if one could determine the conditions reliably (which I don't think one can knowing quite some different support scenario's), one could also built a plugin for it and open a PR for discussion.

A reliable failsafe, which would cover more different scenario's, would probably be to offer snapshots with zfs and go back in time when the user asks for it during boot. As this would also cover kernel/driver issues or software changes people forgot to act upon. Maybe that's something to look into for a future business release, you never know.

AdSchellevis avatar May 07 '22 07:05 AdSchellevis

I can understand your explanation of "Configure and Commit all pending changes". I also believe you are also right in using ZFS snapshots for a more robust solution. However, what if a more simplistic approach was considered.

As I said before, we have the ability to backup "config.xml" as well as restore it again. We even have the "opnsense-importer" to restore the config file on boot I believe.

My understanding to this point is ..... If I have router 1 that dies, I can replace it with identical router 2 and restore with the config.xml from router 1 and, again assuming they are identical hardware, be right back ready to go. If this is correct, I believe we could reasonably be able to assume that the hardware isn't changing since the option would be more for a configuration error rather than a hardware change (which I believe even a ZFS snapshot would have issues with as well?)

Working from that, let us consider Juniper's "Confirm Commit X" Command

Per their site.

To confirm a commit, enter either a commit or commit check command.

If the commit is not confirmed within the time limit, the configuration rolls back automatically to the pre-commit configuration and a broadcast message is sent to all logged-in users. To show when a rollback is scheduled, enter the show system commit command. The allowed range is 1 through 65,535 minutes, and the default is 10 minutes.

This could look like this in OPNSense

  1. Admin enables "Rollback Option" with a timer of "5 minutes"
    • A backup "config.xml" is taken
    • A "Rollback Option Enabled - XX:XX Remaining" notification would appear in the top bar similar to my first post
  2. Admin does whatever is needed in the time frame
    • Commits are not tracked in any other way than normal
  3. Admin DOES stop timer within the given allotment
    • Timer is killed and no restore occurs leaving the system in the current state
  4. Admin DOES NOT stop the timer within the given allotment
    • Logins are paused to prevent changes during restoration
    • Automatic restoration of the previous created config.xml
    • The effect should theoretically be the exact same as if I did it manually.

Using this approach we are not storing changes for bulk commit or really changing the commit process at all..... we are simply saying "I'm about to do something I may not come back from, let me make a backup and set a timer to restore the backup if I don't finish in time"

Again, I do agree this could be built out more with ZFS in future. I will be the first to say I'm not familiar with the deep inner workings of the OS. I believe, however, I see most of the processes in place currently to make the above process work.

As always, I welcome the feedback and do appreciate the consideration. I lack developer skills, however, I am passionate about this project and feel this feature would be most welcome based on the feedback when I have posed the thought.

tcsi-github avatar May 07 '22 08:05 tcsi-github

I don't expect it will work, but as mentioned if it would, the plugin framework offers everything you need, so no need to keep this in core. you (or anyone else) could start working on such a plugin and open a PR there.

AdSchellevis avatar May 07 '22 09:05 AdSchellevis

My apologies for the incorrect placement. I should have realized this could be a plug in.

I appreciate the discussion very much.

tcsi-github avatar May 07 '22 09:05 tcsi-github

Chiming in to say that I would appreciate this feature too. It’d be much nicer to have a simple “wait 20 secs after lights stop blinking”, than (in my case) connecting to the same LAN as the Proxmox server virtualizing OPNsense (rather than my daily VLAN), reverting a snapshot or typing in the command line to undo the changes, and possibly reboot if needed.

Is there some recognizable way for other users to express interest without butting in the middle of implementation discussions?

JJGadgets avatar May 07 '22 09:05 JJGadgets

My apologies for the incorrect placement. I should have realized this could be a plug in.

No problem at all, it's easy to overlook. Don't mind keeping the ticket open for now, just mentioning its (currently) not a core priority and someone wil have to do some work at some point in time in order to mature ideas. When keeping it simple, a plugin would probably be the better place anyway.

AdSchellevis avatar May 07 '22 09:05 AdSchellevis

@JJGadgets Thank you for the interest! I don't believe anyone would view it as butting in. @AdSchellevis Thank you for leaving the request open. I understand there are bigger things than my request. 🙂 I will create a request in plug-ins and reference here.

tcsi-github avatar May 07 '22 09:05 tcsi-github

@tcsi-github let me move this then

AdSchellevis avatar May 07 '22 09:05 AdSchellevis

Thank you sir!

tcsi-github avatar May 07 '22 09:05 tcsi-github

I may look in to implementing this for OPNsense nodes on ZFS filesystems, but no guarantees. Will update if I decide to take this on

CorvetteCole avatar Jul 08 '22 15:07 CorvetteCole

I've been using zfs with snapshots in a virtualized environment for quite some time to rollback if an update failed. Would love to see that in the opnsense GUI as well - either with some kind of timer (roll back in 5 minutes if not stopped in time) or completely manually (like to boot and check out different versions).

In the shell it works. Installed opnsense 22.1.2 on a Sophos appliance on top of zfs. Prior an update I ran a zfs snapshot zroot/ROOT/default. After that it's a bit messy, but what I did is create a clone from the snapshot and could switch then between the original root zroot/ROOT/default and the cloned one zroot/ROOT/clone by setting zpool set bootfs=zroot/ROOT/default zroot or zpool set bootfs=zroot/ROOT/clone zroot. Just reboot and the old/new root filesystem gets mounted.

zfs clones have some disadvantage if one wanted to keep the snapshots/clones for a longer time, so a zfs send/receive might be a better approach. Nice would be an integration into the GUI and the boot loader - like to choose the zfs dataset to be next mounted as root.

From quick testing the only interesting zfs dataset to be snapshotted is zroot/ROOT/default. There are some more (/usr, /var etc.). Some time in the future it might be necessary to snapshot them as well but then it would be nicer if all datasets would not be set up like

zroot
zroot/ROOT
zroot/ROOT/default
zroot/tmp
zroot/usr
zroot/usr/home
zroot/usr/ports
zroot/usr/src
zroot/var
zroot/var/audit
zroot/var/crash
zroot/var/log
zroot/var/mail
zroot/var/tmp

but something like

zroot
zroot/update_22_1_10
zroot/update_22_1_10/ROOT
...
zroot/prod
zroot/prod/ROOT
zroot/prod/ROOT/default
zroot/prod/tmp
zroot/prod/usr
zroot/prod/usr/home
zroot/prod/usr/ports
zroot/prod/usr/src
zroot/prod/var
zroot/prod/var/audit
zroot/prod/var/crash
zroot/prod/var/log
zroot/prod/var/mail
zroot/prod/var/tmp

It's much easier then to snapshot all relevant datasets by e.g. zfs snapshot -r zroot/prod@update_22_1_10 and clone them into zroot/update_22_1_10. I tried that out manually as well by renaming the datasets from zroot/... to zroot/prod/... and it worked as well.

If something breaks it's also possible to boot from an opnsense usb installer thumb drive, issue a zfs import -f zroot and clone/rollback/set/whatever the zfs datasets from within the live system.

I think zfs snapshots might be a way to roll back from a failed update or a misconfiguration.

Happy to discuss this further.

spi43984 avatar Jul 08 '22 16:07 spi43984

yeah I figured ZFS snapshots would be the way to go. I've heard that UFS supports snapshots as well, but don't know much about that

CorvetteCole avatar Jul 08 '22 18:07 CorvetteCole

Anyone experienced in writing plugins for opnsense? Could try and create something together for zfs snapshots and rollbacks.

spi43984 avatar Jul 08 '22 19:07 spi43984

I think the first step would be a plugin that you can interact with in the UI to manually snapshot and restore snapshots. Once we've got that, hooking it in to system events should be a little less... painful? I have no experience with OPNsense plugins but will look around and see if I can get some knowledge to start building a base for that or something

CorvetteCole avatar Jul 08 '22 19:07 CorvetteCole

see https://github.com/opnsense/plugins/tree/master/devel/grid_example and https://github.com/opnsense/plugins/tree/master/devel/helloworld. I'm poking around those rn

CorvetteCole avatar Jul 08 '22 19:07 CorvetteCole

Yep, that sounds like a good way to start.

spi43984 avatar Jul 08 '22 19:07 spi43984

Looks like we can write the 'backend' part of this in python

CorvetteCole avatar Jul 08 '22 19:07 CorvetteCole

and here are some more docs. I'll see if I can create a bit of a template interface for this, give me a bit https://docs.opnsense.org/development/examples/helloworld.html

CorvetteCole avatar Jul 08 '22 19:07 CorvetteCole

If the backend is in python - does it have an API for zfs or do we need to call zfs shell commands?

spi43984 avatar Jul 08 '22 19:07 spi43984

I suspect we'll have to use zfs shell commands, even if there are python libraries for interacting with zfs. see here for configd which we would probably use to do this: https://docs.opnsense.org/development/backend/configd.html.

It would be nice to use a pip package, unsure on what we are allowed to do with that

CorvetteCole avatar Jul 08 '22 19:07 CorvetteCole

am going to try to develop here an interface for the manual component of this: https://github.com/CorvetteCole/opnsense-plugins-snapshot

would be nice to have an experienced plugin dev help create the interface though. Will update this issue with info once I get this package sort of set up

CorvetteCole avatar Jul 08 '22 19:07 CorvetteCole

here it is on my branch. Calling it opn-snapshot for now but none of this is set in stone. I'll try to work out a front-end when I have time. I also know how we will implement automatic rollbacks although I think it will only be able to recover if the system boots successfully, unsure if we will be able to save from failed upgrades.

I've added you as a collaborator on the repository so you are welcome to work on what you find interesting as well. We need to figure out some sort of framework for how we will call for zfs snapshot restore and creation, as well as ways to query what snapshots are available etc. Obviously some thought to be done

https://github.com/CorvetteCole/opnsense-plugins-snapshot/tree/master/sysutils/opn-snapshot

CorvetteCole avatar Jul 08 '22 20:07 CorvetteCole

I also know how we will implement automatic rollbacks although I think it will only be able to recover if the system boots successfully, unsure if we will be able to save from failed upgrades.

That's why I raised the question about adding this to the boot menu as well. If opnsense does not come up properly one could choose another dataset as root from the boot menu.

We need to figure out some sort of framework for how we will call for zfs snapshot restore and creation, as well as ways to query what snapshots are available etc. Obviously some thought to be done

We might want to talk to the opnsense-core guys to change the zfs hierarchy as depicted in my comment earlier today. Would make finding the right datasets easier.

spi43984 avatar Jul 08 '22 20:07 spi43984