split-gpg2 VM hangs on signing too many commits in a row
Qubes OS release
Qubes OS 4.2
Brief summary
Rebasing a large branch (signing 70+ commits) can make the split-gpg2 VM freeze up. This happens possibly because of all the swap space is filled after a while (see screenshot below).
Once all swap space is full, the VM hangs for a while. After a few minutes some "gpg access granted" notifications are shown late.
The split-gpg2 VM has to be restarted/killed to continue committing.
Steps to reproduce
- Rebase a branch with many commits (signing required of course).
- Observe ~~split-gpg2~~
notification-daemonuses up all the memory/swap without cleaning up. - VM hangs
Expected behavior
Split-gpg2 is able to sign all commits, even with only 1GB of swap + 500 MB of memory. The notification-daemon doesn't consume all the memory.
Actual behavior
VM freezes:
htop at the time of freeze:
Additional information
No response
What was using all the memory?
Given memory consumption you observed this is probably not the problem here, but note there's also #5343 which can lead to a hanging target domain.
What was using all the memory?
As it turns out, it is the notification daemon:
During the signing:
Just before the freeze:
Notice that systemd-oomd doesn't seem to do its job.
Given memory consumption you observed this is probably not the problem here, but note there's also https://github.com/QubesOS/qubes-issues/issues/5343 which can lead to a hanging target domain.
I couldn't see any xenbus: xen store gave: unknown error E2BIG messages in journalctl -r.
I really hope https://github.com/QubesOS/qubes-issues/issues/889 will help (it replaces full notification daemon in each VM, with a lightweight proxy).
Hopefully!
Just as another data point. If the rebase finishes before a critical number of commits are signed, it seems to "un-freeze" after a short while. But there are still way way to many processes running, although nothing is signed:
Rebasing a large branch (signing 70+ commits)
Just had this happen to me. I decided to rebase without signing and ammending a signature to the commit, this only works if you are squashing every commit of the rebase.
I really hope https://github.com/QubesOS/qubes-issues/issues/889 will help (it replaces full notification daemon in each VM, with a lightweight proxy).
Confirmed that it is notifications, although I don't have the gnome-notification-daemon installed but dunst, it also hangs after some commits. I changed the split-gpg2 configuration to verbose_notifications = no and it didn't hang anymore, so it is definitely the notification section that is buggy, might even be notify-send and not the server, but just a guess.
I also again run into this. I think it's not actually the notification daemon. That might be one of the processes that consumes most memory per process. But at least for me it didn't increase much. This matches also the above screenshots and missing OOMs.
What I observed is that the split-gpg2 server that is started per qubes.Gpg2 call, never terminates. This leads to many python processes hanging around, each consuming a few percent of the memory.
I tracked it down to wait_close hanging on the client_writer. Looks like a consequence of https://github.com/QubesOS/qubes-app-linux-split-gpg2/commit/f488ef10e42e39c22f7b5e95004b569f3acf5f1f and https://github.com/QubesOS/qubes-app-linux-split-gpg2/commit/2eb10acb15ecd8c05b301cbf4bdac6ba972a63e5.
Experimental fix: https://github.com/QubesOS/qubes-app-linux-split-gpg2/pull/24 (needs review by someone understanding asyncio internals. Skimming over asyncio's code that change seems reasonable, and it works for me, but not sure about unintended side effects).
Not quite sure why verbose_notifications = no helped for @ben-grande. Maybe this just slightly reduced the memory consumption of the notification daemon, allowing a few more python processes to stay around?