qubes-mirage-firewall
qubes-mirage-firewall copied to clipboard
mirage vm crash on downstream vm start/stop
somewhat outdated build, but build-with-docker failing currently for me, and i can not find any closed issue that sounds related, so may be unfixed yet.
mirage fw vm sometimes exits on downstream vm start/stop, with errors like this:
2018-11-04 01:19:13 -00:00: INF [client_net] add client vif {domid=462;device_id=0} Fatal error: exception Xs_protocol.Enoent("unwatch") Raised at file "map.ml", line 122, characters 10-25 Called from file "src0/sexp_conv.ml", line 149, characters 10-37 Mirage exiting with status 2 Do_exit called!
I haven't seen that before. The error is coming from the XenStore client's unwatch
function. The firewall doesn't call that directly, but it does call XenStore's wait
function, which uses it:
https://github.com/mirage/ocaml-xenstore/blob/master/client_lwt/xs_client_lwt.ml#L394
That calls it from an async
block, so there's no easy way for the firewall code to catch the exception.
It's probably worth opening a bug there, giving the error you see and the version of xenstore you're using (opam info xenstore
in the environment in which you built the firewall should show this).
I guess the problem is that it tries to stop watching a store entry when the VM goes away or changes its network settings, and gets an error because the thing it was watching has already gone away. I don't know enough about how XenStore watches work to know the correct fix. Possibly the xenstore client should just ignore the exception, assuming the server has already stopped watching something that no longer exists.
afaict this still happens with a 20190416 build (aka "beyond the icmp issue").
i moved the backup vm (which gets restarted hourly) to a separate mirage-firewall instance some months ago, and it looks like that backup-mfwt is still exploding multiple times per week.
I've made a test branch with a version of XenStore that logs this error instead of crashing.
To test:
git clone 'https://github.com/talex5/qubes-mirage-firewall.git' -b unwatch-crash
cd qubes-mirage-firewall/
./build-with-docker.sh
Then deploy as usual. On start-up, this version should log:
2019-11-22 11:36:15 -00:00: INF [xenstore.client] ** XenStore with unwatch patch **
If it still crashes, it should crash with a different error, showing the real problem. If it no longer crashes, please check the logs after a while for ERR
messages.
built and deployed on two systems in a role that triggers this bug 2-3 times a week. (fw for backup-sync vm that gets started/stopped once per hour)
2019-12-17 04:36:41 -00:00: INF [xenstore.client] ** XenStore with unwatch patch ** .... 2019-12-19 17:33:22 -00:00: INF [net-xen backend] Frontend asked to close network device dom:3167/vif:0 2019-12-19 17:33:22 -00:00: INF [client_net] client {domid=3167;device_id=0} has gone 2019-12-19 18:25:56 -00:00: INF [client_net] add client vif {domid=3168;device_id=0} with IP 10.137.0.26 2019-12-19 18:25:56 -00:00: ERR [xenstore.client] Error unwatching "/local/domain/3168/device/vif/0/state": Xs_protocol.Enoent("unwatch") Fatal error: exception Xs_protocol.Error("E2BIG") Raised at file "map.ml", line 135, characters 10-25 Called from file "sexp_conv.ml", line 147, characters 10-37 Mirage exiting with status 2 Do_exit called!
Ah, OK! This is a quota-exceeded error from XenStore in dom0 because we have too many watches.
The problem seems to be that Qubes does not completely clean up domains from XenStore when they are shut down. Even e.g. repeatedly starting and shutting down a Linux domain connected directly to sys-net leaves a trail of empty /local/domain/$SYSNET/backend/vif/*
directories in XenStore. e.g.
dom0$ xenstore-ls /local/domain/1/backend/vif
[...]
40 = ""
41 = ""
42 = ""
43 = ""
Because these directories still exist, mirage-firewall looks inside them to get the network settings. It doesn't find anything, but because it does this in the watch loop it still adds these old paths to the watch list and tries to monitor them. The firewall could use a different XenStore handle for this so that the XenStore client doesn't try to monitor them, but I think we do actually want to notice if a domain has an interface added or removed.
I've made a PR to xenstore so that it should report quota errors sensibly: https://github.com/mirage/ocaml-xenstore/pull/47
Around 1.5 years later, I cut a release of ocaml-xenstore with @talex5 fix. I'm not sure whether this actually fixes this issue, or just reports the error more nicely. Also, with the qubes issue linked above that is fixed, is this still an issue @xaki23?
closing, please re-open if the issue persists in 0.8.0 (or the 0.8.1 which is ready in a minute)
This is still an issue, I've heard.
re-reported with a fresh backtrace in #155