Container snapshots/restore feature and other critical bug fixes
PR Summary
This PR introduces a feature to snapshot/restore the sandbox storage + several important fixes and features to enhance the overall functionality of the system.
Key Changes:
- Snapshot/restore of container storage feature added, resolving https://github.com/All-Hands-AI/OpenHands/issues/6163
- Fix for parsing config
docker_runtime_kwargs = { privileged = true }https://github.com/All-Hands-AI/OpenHands/issues/5569
New Feature: Container Snapshot and Restore
This feature introduces the ability to snapshot and restore container storage. It mounts Docker's storage directory (/var/lib/docker) on a virtual Btrfs volume using a loop device within the OpenHands (OH) environment. This approach has been tested on Windows using VS Code and Dev Containers, with OH running in a Docker container and creating sandboxes in nested containers.
Motivation:
The primary goal of this feature is to enable reverting changes made by OH across the entire system. This allows users to effectively undo any system modifications introduced during interactions within OH, promoting a safer and more predictable development and testing environment.
Requirements
- The host must have btrfs support (check
/proc/filesystems). Your filesystem can be any type though.
Important Notes:
- The workspace is mounted separately and is not included in the container storage, meaning it is excluded from snapshots. Since the workspace is shared across containers or conversations, it would not be appropriate to roll back changes that affect other sessions or containers. See: https://github.com/All-Hands-AI/OpenHands/issues/6457.
Current Status:
- No additional features will be added at this moment.
- Frontend developers are welcome to implement a button to allow users to revert snapshots (one per user message).
Usage Instructions:
To enable the snapshot and restore functionality, add the following configuration to your sandbox settings:
[sandbox]
docker_snapshots = true
If disabled, the Docker configuration will revert upon restarting OH.
Snapshot Creation:
- Every time a user sends a message, a new snapshot of the container storage will be created.
- The logs will display the command required to restore each snapshot.
Testing
- The file
openhands/storage/docker_snapshots.pycontains a test case (unit-tests in comments) to debug and understand how it works.
Feel free to reach out if there are any issues or further questions.
This is a great feature, thank you for working on it!
It might be good to try to keep the bug fixes separate from new feature(s). One smaller fixes PR and one large feature PR would mean that the fixes PR is faster to review and merge. If you need the fixes too, you can create/rebase the feature branch on top on the fixes branch. It shouldn't cause much trouble when merged, then.
Up to you, it just would be better IMHO. Because I think you are correct that the fixes are needed (independently), so it would be nice if we can take them in before the snapshots feature is ready.
It might be good to try to keep the bug fixes separate from new feature(s).
Here I isolated the critical bug fixes: https://github.com/All-Hands-AI/OpenHands/pull/6460
Let's remove the dev_container stuff from this PR to keep changes isolated.
Feel free to open a second PR for dev_container, but no guarantees there. The maintainers explicitly chose to remove dev_container, and I don't want to add it back without gathering consensus first
I tried this on mac and it doesn't really work, and it looks like it now requires sudo root privileges, which feels unsafe to give to an app.
On a real deployment (SaaS in kubernetes, on-prem corporate, etc..) that serves multiple users, this would need a significantly different implementation (support from the runtime infra itself).
What's the plan for making this actually work in a real production deployment?
I tried this on mac and it doesn't really work, and it looks like it now requires
sudoroot privileges, which feels unsafe to give to an app.
What exactly didn't work? any error msg?
You can try the test case commented here: https://github.com/All-Hands-AI/OpenHands/pull/6441/files#diff-56f99ca967ee59d7ce3134b274798e48f4fce237dce9c8dd6d2b3c419781fb28R3
and it looks like it now requires
sudoroot privileges, which feels unsafe to give to an app.
I'm running OH inside a container and sandboxes as nested containers (and sometimes those sandboxes create nested-nested-containers). I don't understand why people is running OH on the host. Probably not windows users.
Docker runs as root on my OH env. If you want to restart a sandbox using a restored snapshot, you will need root access (sudo). Alternatively, you could run the commands manually yourself, or setup your environment to run docker as a normal user, but I'm not supporting this use case.
support from the runtime infra itself
The runtime operates within the sandbox. While the runtime can trigger the snapshot/restore operation, the actual handling must occur outside the sandbox. Otherwise, you won't be able to capture the entire sandbox storage during the snapshot.
What's the plan for making this actually work in a real production deployment?
You mean running OH in production like a "OH as a service"? Or you mean running a sandbox in production? (Using "OH" and "in production" in the same sentence still sound a bit strange to me).
We are snapshotting sandboxes (docker containers) and docker containers are widely used in production. If you are interested in taking and restoring snapshots on a bare metal host or VM, the same implemented strategy could be applied, but instead of taking a snapshot of the sandbox's container storage, we can take a snapshot of a mounted Btrfs volume. And as I mentioned before, this wouldn't be a complete snapshot, only of the designated mounted volume.
Related: For https://github.com/All-Hands-AI/OpenHands/issues/6457 we will probably need to snapshot a designated workspace. This is another additional requirement that might also align well with your use case.
I tried this on mac and it doesn't really work,
Make sure to chmod +x openhands/storage/docker_snapshots.py
Fixed in https://github.com/All-Hands-AI/OpenHands/pull/6441/commits/780acd08c56a089ce8faab696ee5509c48a5c79a
@kripper I think we need to break this into multiple PRs
The reattach on ExposedPorts is a simple bug, we should be able to merge that fix quickly
The snapshot stuff seems super cool, would love to get that in
The devcontainer stuff is controversial, and I'd love to have a separate PR where we can discuss whether or not to merge that
Oh nice looks like a good chunk of it was broken out here! https://github.com/All-Hands-AI/OpenHands/pull/6460/files 🙏
Ok. I will remove the DevContainer folder from this PR in the future. My expectation was the other PR to be merged and so to avoid the additional effort of preparing PRs with cherry picked commits.
This is a neat idea, but I feel like for a multi-runtime and cross-platform (Mac, Linux, Windows), the sandbox snapshot & restore capability should be:
- be introduced at the top Runtime level, so that it's supported on all types of deployments (local and Kubernetes)
- have a common denominator base implementation that works across all platforms, using more basic approaches such as raw
cp -rorrsync, which works on all OS - each specific runtime, if they have a more specific implementation, can override and do something bespoke
The way this PR is implemented right now, it's too focused on Linux kernel with root + btrfs support, whereas that's not the case for Mac OSX docker engine, and not the case for Google Cloud Kubernetes engine. The requirement to have sudo and install system level packages in the host machine is a security concern.
I think OpenHands maintainers should discuss whether snapshot & restore is a prioritized feature, and if it is slotted onto the roadmap, discuss how to have generally usable implementation of it that isn't so kernel & btrfs dependent.
My recommendation is to rely on user-space tooling like cp -r, rsync and hardlinks and such that are user-space and broadly supported by all runtimes.
This is a neat idea, but I feel like for a multi-runtime and cross-platform (Mac, Linux, Windows), the sandbox snapshot & restore capability should be:
- be introduced at the top Runtime level, so that it's supported on all types of deployments (local and Kubernetes)
- have a common denominator base implementation that works across all platforms, using more basic approaches such as raw
cp -rorrsync, which works on all OS- each specific runtime, if they have a more specific implementation, can override and do something bespoke
The way this PR is implemented right now, it's too focused on Linux kernel with root + btrfs support, whereas that's not the case for Mac OSX docker engine, and not the case for Google Cloud Kubernetes engine. The requirement to have
sudoand install system level packages in the host machine is a security concern.I think OpenHands maintainers should discuss whether snapshot & restore is a prioritized feature, and if it is slotted onto the roadmap, discuss how to have generally usable implementation of it that isn't so kernel & btrfs dependent.
My recommendation is to rely on user-space tooling like
cp -r,rsyncand hardlinks and such that are user-space and broadly supported by all runtimes.
I agree most of this comments. Not only the Mac and gcp, I remember currently openhands still have some problem on Windwos without WSL. I agree that we should discuss whether snapshot & restore is a prioritized feature.
This PR is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This PR was closed because it has been stalled for over 30 days with no activity.