qubes-issues icon indicating copy to clipboard operation
qubes-issues copied to clipboard

split-gpg2 VM hangs on signing too many commits in a row

Open Rot127 opened this issue 10 months ago • 6 comments

Qubes OS release

Qubes OS 4.2

Brief summary

Rebasing a large branch (signing 70+ commits) can make the split-gpg2 VM freeze up. This happens possibly because of all the swap space is filled after a while (see screenshot below).

Once all swap space is full, the VM hangs for a while. After a few minutes some "gpg access granted" notifications are shown late.

The split-gpg2 VM has to be restarted/killed to continue committing.

Steps to reproduce

  1. Rebase a branch with many commits (signing required of course).
  2. Observe ~~split-gpg2~~ notification-daemon uses up all the memory/swap without cleaning up.
  3. VM hangs

Expected behavior

Split-gpg2 is able to sign all commits, even with only 1GB of swap + 500 MB of memory. The notification-daemon doesn't consume all the memory.

Actual behavior

VM freezes:

htop at the time of freeze:

Image

Additional information

No response

Rot127 avatar Feb 16 '25 13:02 Rot127

What was using all the memory?

Given memory consumption you observed this is probably not the problem here, but note there's also #5343 which can lead to a hanging target domain.

HW42 avatar Feb 19 '25 09:02 HW42

What was using all the memory?

As it turns out, it is the notification daemon:

During the signing:

Image

Just before the freeze:

Image

Notice that systemd-oomd doesn't seem to do its job.

Given memory consumption you observed this is probably not the problem here, but note there's also https://github.com/QubesOS/qubes-issues/issues/5343 which can lead to a hanging target domain.

I couldn't see any xenbus: xen store gave: unknown error E2BIG messages in journalctl -r.

Rot127 avatar Feb 20 '25 15:02 Rot127

I really hope https://github.com/QubesOS/qubes-issues/issues/889 will help (it replaces full notification daemon in each VM, with a lightweight proxy).

marmarek avatar Feb 20 '25 15:02 marmarek

Hopefully!

Just as another data point. If the rebase finishes before a critical number of commits are signed, it seems to "un-freeze" after a short while. But there are still way way to many processes running, although nothing is signed:

Image

Rot127 avatar Feb 20 '25 15:02 Rot127

Rebasing a large branch (signing 70+ commits)

Just had this happen to me. I decided to rebase without signing and ammending a signature to the commit, this only works if you are squashing every commit of the rebase.

I really hope https://github.com/QubesOS/qubes-issues/issues/889 will help (it replaces full notification daemon in each VM, with a lightweight proxy).

Confirmed that it is notifications, although I don't have the gnome-notification-daemon installed but dunst, it also hangs after some commits. I changed the split-gpg2 configuration to verbose_notifications = no and it didn't hang anymore, so it is definitely the notification section that is buggy, might even be notify-send and not the server, but just a guess.

ben-grande avatar May 13 '25 18:05 ben-grande

I also again run into this. I think it's not actually the notification daemon. That might be one of the processes that consumes most memory per process. But at least for me it didn't increase much. This matches also the above screenshots and missing OOMs.

What I observed is that the split-gpg2 server that is started per qubes.Gpg2 call, never terminates. This leads to many python processes hanging around, each consuming a few percent of the memory.

I tracked it down to wait_close hanging on the client_writer. Looks like a consequence of https://github.com/QubesOS/qubes-app-linux-split-gpg2/commit/f488ef10e42e39c22f7b5e95004b569f3acf5f1f and https://github.com/QubesOS/qubes-app-linux-split-gpg2/commit/2eb10acb15ecd8c05b301cbf4bdac6ba972a63e5.

Experimental fix: https://github.com/QubesOS/qubes-app-linux-split-gpg2/pull/24 (needs review by someone understanding asyncio internals. Skimming over asyncio's code that change seems reasonable, and it works for me, but not sure about unintended side effects).

Not quite sure why verbose_notifications = no helped for @ben-grande. Maybe this just slightly reduced the memory consumption of the notification daemon, allowing a few more python processes to stay around?

HW42 avatar May 14 '25 12:05 HW42