Work around pacman hangs on Windows/ARM64
Both in the MSYS2 project as well as in the Git for Windows project, automated builds on Windows/ARM64 are plagued by semi-randomly hanging pacman processes.
@jeremyd2019 did a great job diagnosing these, and I also started to dig into this. After many, many months (with many, many breaks), I found this here work-around.
Here is a successful run of Git for Windows' sync job that tries to update, commit & push the git-sdk-arm64 repository. Previously it consistently ran into those hangs, and replacing the pacman.exe with the version built in this here PR worked around those hangs.
Technical description
A common symptom is that the hanging process has a command-line that is identical to its parent process' command-line (indicating that it has been fork()ed), and anecdotally, the hang occurs when _exit() calls proc_terminate() which is then blocked by a call to TerminateThread() with an invalid thread handle (for more details, see https://github.com/msys2/msys2-autobuild/issues/62#issuecomment-1951796327).
In my tests, I found that the hanging process is spawned from _gpgme_io_spawn() which lets the child process immediately spawn another child. That seems like a fantastic way to find timing-related bugs in the MSYS2/Cygwin runtime.
As a work-around, it does seem to help if we avoid that double-fork.
This partially reverts 61aa1947 (... Use a double-fork approach..., 2002-08-28).
In private communication the GIMP team was told that pacman hanging on arm may be solved in Windows OS build 26080.1. It is of course to be seen when that particular fix will be available in stable Windows releases.
In private communication the GIMP team was told that pacman hanging on arm may be solved in Windows OS build 26080.1. It is of course to be seen when that particular fix will be available in stable Windows releases.
Thank you for sharing this! And yes, I'd like to have the work-around in hand earlier than that.
Let me know if you have any ideas on how I can help test this. I will think on it a bit, maybe just stress-test pacman on Win10 on Raspberry PI and Win11 on VM on dev kit, as I did last time.
Let me know if you have any ideas on how I can help test this. I will think on it a bit, maybe just stress-test pacman on Win10 on Raspberry PI and Win11 on VM on dev kit, as I did last time.
Yes, stress-testing would be good, after installing the package from the PR build. Thank you for all your help @jeremyd2019!
I just did some testing as well. Created this PR with the package from this PR build installed.
- I ran the
syncworkflow 17 times. We went from a (close to) 0% success rate before applying this patch, to a 100% success rate after applying it. - Did a bunch of
pacmaninstalls locally on my Surface Pro X, didn't experience hangs anymore. - Created an Azure ARM64 VM in my own Azure account and tested there as well. Also, no more hangs!
This looks very promising. Thanks a lot for diving in and providing the patch, @dscho! 🎉
@dennisameling thank you so much for verifying that the work-around does what we'd hoped it would do!
I built this PR for i686 msys2, and have been running it on raspberry pi 4/windows 10 for at least 24 hours. It has not hung yet, which is better than it had done before this change.
I extracted the 'double-fork' logic from gpgme and am going to test with that next to see if I can reproduce a hang in something simpler than pacman-statically-linked-to-gpgme as a reproducer. (see https://github.com/msys2/msys2-autobuild/issues/62#issuecomment-2099461671)
Note that one of the incarnations of this was apparently in pacman calling an install-info hook, which should have nothing to do with gpgme, so I'm not sure this workaround would catch every possible hang.
Note that one of the incarnations of this was apparently in pacman calling an install-info hook, which should have nothing to do with gpgme, so I'm not sure this workaround would catch every possible hang.
Right! It would only be a band-aid for the common case.
FWIW I also experienced hangs when code-signing with osslsigncode, which would also not be covered.
So this here PR would really only be helping with some scenarios, but since I expect the real fix to still be a ways off, it would be better to have this work-around than not to have it.
Unless.
Unless that double-fork has a deeper purpose that escapes me (maybe it tries to prevent some attack vector? The commit message is quite mum about the intention of that change.
BTW I've edited the PR description to make the information I hid in that .patch message more discoverable.
A couple of questions I want to clear up:
- will this change result in zombie processes? I think on a real posix system, it would result in zombies for the lifetime of the process which invokes gpgme functions. I don't know if Cygwin replicates this particular behavior of posix though. I will try to test this and see.
- what else might be affected by the change to gpgme? Most seem to be pacman-related, but wget was unexpected:
'wget', 'mutt', 'libgpgme-python', 'pacutils', 'pacman-contrib', 'python-pyalpm', 'pacman'
It looks like wget only uses gpgme for metalink, and both end up disabled in the build in the msys2 repo because --without-metalink is specified to configure. It could therefore be removed from depends/makedepends of wget.
- will this change result in zombie processes?
Those zombie processes are caused by the way Unix/Linux works: To be able to read the exit code of the exited process, there is an entry in the process table that is kept until someone calls waitpid() with the corresponding pid.
That's not how the Windows kernel works, so I imagine Cygwin emulates this by keeping a separate process running (which might be the very same hanging process that we're seeing). It's not technically a zombie process, though.
@jeremyd2019 I will try to find time later this week to see whether this change does leave unwanted "eternally-running" processes behind, unless you beat me to it.
I haven't had time to check what happens yet, will probably not get time until tomorrow (say, 20ish hours).
What I kind of expect to see is that Cygwin's shared memory process table will have a 'zombie' entry in it (in order to properly emulate posix behavior wrt the wait family of functions), but that the Windows process exits as expected. I do not know whether Cygwin would keep a Windows HANDLE open to the exited process (to call GetExitCodeProcess when somebody calls wait) or if they take care of that in a wait thread somewhere and thus do not leave 'zombie' Windows processes around (if the object still has outstanding HANDLE references it cannot be freed in the kernel). I thought about asking on [email protected] about what happens in this case, but it would be easy enough to test and find out.
I did a test this evening on x86_64, and I'm not seeing any evidence of zombie processes in ps, /proc/*, or in task manager when removing inner fork and waitpid calls from my test program as you did in the patch on this PR. I'm sure cygwin must keep track of the exit code of the processes somehow because it cannot know nobody will ever call wait, but it doesn't seem to expose "zombie" entries in ps.
Well, some "zombie" entries are in /proc/sys/BaseNamedObjects/msys-2.0*/cygpid.*. Also the entries in chld_procs. They look like they have valid hProcess members, so there are probably kernel process objects (EPROCESS? been a while since I've poked these things in kernel mode) still being kept around too
I wonder whether GPGME should call waitpid() somewhere, which is the recommended way to get rid of zombie processes as far as I can tell...
In private communication the GIMP team was told that pacman hanging on arm may be solved in Windows OS build 26080.1
The hangs are still happening to me on 26120.670... maybe that fixed a different hang?
The hangs are still happening to me on 26120.670... maybe that fixed a different hang?
It was said that an MSYS2 issue was resolved, since I was/am not aware of any other issues, I assumed it must be the pacman hang. I'll ask back for clarification.
Only slightly related, #4605 reduces the amount of gpgme spawns a bit, for example when starting pacman from 27 down to 19
$ GPGME_DEBUG=7 pacman |& grep "_gpgme_io_spawn: leave" | wc -l
19
And one less spawn per package install, from 3 down to 2.
I received a clarification about what was fixed, so indeed this was not the pacman hang:
Below was issue seen (msys2 command hang), that was resolved in build 26080.
1. Download https://repo.msys2.org/distrib/x86_64/msys2-base-x86_64-20221028.tar.xz
2. Extract the archive
3. "cd" to the extracted archive in a clean CMD window
4. Execute "msys2_shell.cmd -defterm -no-start -clang64 -c exit"
5. Program hangs, with bash taking up 10% cpu
In case anyone is waiting for me for some reason, my view is that this is not a viable workaround to merge here because:
- it will result in leaked cygpids ("zombie" processes) for the lifetime of the process calling gpgme.
- this will happen for all consumers of gpgme, not just pacman (though there aren't many non-pacman-related consumers of gpgme in MSYS2-packages)
- this will happen for all x86_64 systems, which are not experiencing the issue, to work around an issue on arm64 systems, which are much fewer.
I am hopeful that someone will help try to figure out what's going on. I am guessing Corinna is on vacation or otherwise offline. If nothing else, maybe I can try to hack around on the wait thread terminate call some more and see what happens.
Was this issue ever resolved? This issue affects a large number of systems used in production today
No. I still have not gotten any reply on cygwin list on this issue, and I've not had any luck so far trying to figure something out on my own. Latest post (after a comment that the thread was previously incorrectly sent to cygwin-developers): https://cygwin.com/pipermail/cygwin/2024-July/256271.html
No. I still have not gotten any reply on cygwin list on this issue, and I've not had any luck so far trying to figure something out on my own. Latest post (after a comment that the thread was previously incorrectly sent to cygwin-developers): https://cygwin.com/pipermail/cygwin/2024-July/256271.html
The status update is much appreciated, thanks! Let's hope that this gets fixed on arm soon
Hopefully this is now taken care of in msys2/msys2-runtime#234
Hopefully this is now taken care of in msys2/msys2-runtime#234
Yes! I tested this and it worked!
Thank you so much @jeremyd2019 for your persistence, time and patience!