qubes-mirage-firewall icon indicating copy to clipboard operation
qubes-mirage-firewall copied to clipboard

mirage vm crash on downstream vm start/stop

Open xaki23 opened this issue 5 years ago • 8 comments

somewhat outdated build, but build-with-docker failing currently for me, and i can not find any closed issue that sounds related, so may be unfixed yet.

mirage fw vm sometimes exits on downstream vm start/stop, with errors like this:


2018-11-04 01:19:13 -00:00: INF [client_net] add client vif {domid=462;device_id=0} Fatal error: exception Xs_protocol.Enoent("unwatch") Raised at file "map.ml", line 122, characters 10-25 Called from file "src0/sexp_conv.ml", line 149, characters 10-37 Mirage exiting with status 2 Do_exit called!

xaki23 avatar Nov 04 '18 13:11 xaki23

I haven't seen that before. The error is coming from the XenStore client's unwatch function. The firewall doesn't call that directly, but it does call XenStore's wait function, which uses it:

https://github.com/mirage/ocaml-xenstore/blob/master/client_lwt/xs_client_lwt.ml#L394

That calls it from an async block, so there's no easy way for the firewall code to catch the exception.

It's probably worth opening a bug there, giving the error you see and the version of xenstore you're using (opam info xenstore in the environment in which you built the firewall should show this).

I guess the problem is that it tries to stop watching a store entry when the VM goes away or changes its network settings, and gets an error because the thing it was watching has already gone away. I don't know enough about how XenStore watches work to know the correct fix. Possibly the xenstore client should just ignore the exception, assuming the server has already stopped watching something that no longer exists.

talex5 avatar Nov 04 '18 14:11 talex5

afaict this still happens with a 20190416 build (aka "beyond the icmp issue").

i moved the backup vm (which gets restarted hourly) to a separate mirage-firewall instance some months ago, and it looks like that backup-mfwt is still exploding multiple times per week.

xaki23 avatar Jul 18 '19 23:07 xaki23

I've made a test branch with a version of XenStore that logs this error instead of crashing.

To test:

git clone 'https://github.com/talex5/qubes-mirage-firewall.git' -b unwatch-crash
cd qubes-mirage-firewall/
./build-with-docker.sh

Then deploy as usual. On start-up, this version should log:

2019-11-22 11:36:15 -00:00: INF [xenstore.client] ** XenStore with unwatch patch **

If it still crashes, it should crash with a different error, showing the real problem. If it no longer crashes, please check the logs after a while for ERR messages.

talex5 avatar Nov 22 '19 11:11 talex5

built and deployed on two systems in a role that triggers this bug 2-3 times a week. (fw for backup-sync vm that gets started/stopped once per hour)

xaki23 avatar Dec 17 '19 18:12 xaki23

2019-12-17 04:36:41 -00:00: INF [xenstore.client] ** XenStore with unwatch patch ** .... 2019-12-19 17:33:22 -00:00: INF [net-xen backend] Frontend asked to close network device dom:3167/vif:0 2019-12-19 17:33:22 -00:00: INF [client_net] client {domid=3167;device_id=0} has gone 2019-12-19 18:25:56 -00:00: INF [client_net] add client vif {domid=3168;device_id=0} with IP 10.137.0.26 2019-12-19 18:25:56 -00:00: ERR [xenstore.client] Error unwatching "/local/domain/3168/device/vif/0/state": Xs_protocol.Enoent("unwatch") Fatal error: exception Xs_protocol.Error("E2BIG") Raised at file "map.ml", line 135, characters 10-25 Called from file "sexp_conv.ml", line 147, characters 10-37 Mirage exiting with status 2 Do_exit called!

xaki23 avatar Dec 19 '19 18:12 xaki23

Ah, OK! This is a quota-exceeded error from XenStore in dom0 because we have too many watches.

The problem seems to be that Qubes does not completely clean up domains from XenStore when they are shut down. Even e.g. repeatedly starting and shutting down a Linux domain connected directly to sys-net leaves a trail of empty /local/domain/$SYSNET/backend/vif/* directories in XenStore. e.g.

dom0$ xenstore-ls /local/domain/1/backend/vif
[...]
40 = ""
41 = ""
42 = ""
43 = ""

Because these directories still exist, mirage-firewall looks inside them to get the network settings. It doesn't find anything, but because it does this in the watch loop it still adds these old paths to the watch list and tries to monitor them. The firewall could use a different XenStore handle for this so that the XenStore client doesn't try to monitor them, but I think we do actually want to notice if a domain has an interface added or removed.

talex5 avatar Dec 22 '19 13:12 talex5

I've made a PR to xenstore so that it should report quota errors sensibly: https://github.com/mirage/ocaml-xenstore/pull/47

talex5 avatar Nov 25 '20 18:11 talex5

Around 1.5 years later, I cut a release of ocaml-xenstore with @talex5 fix. I'm not sure whether this actually fixes this issue, or just reports the error more nicely. Also, with the qubes issue linked above that is fixed, is this still an issue @xaki23?

hannesm avatar Jul 26 '22 19:07 hannesm

closing, please re-open if the issue persists in 0.8.0 (or the 0.8.1 which is ready in a minute)

hannesm avatar Sep 14 '22 09:09 hannesm

This is still an issue, I've heard.

hannesm avatar Oct 27 '22 18:10 hannesm

re-reported with a fresh backtrace in #155

hannesm avatar Oct 27 '22 19:10 hannesm