samurai
samurai copied to clipboard
samurai hangs when compiling LLVM 17.0.5
G'day, samurai contributors.
Recently, when working at Copacabana's build system, I perceived that the LLVM build would hang specifically betwixt object 1004 (tools/clang/lib/Sema/CMakeFiles/obj.clangSema.dir/SemaDeclObjC.cpp.o) and 1012 (tools/clang/lib/Sema/CMakeFiles/obj.clangSema.dir/SemaDeclObjC.cpp.o) when building with samu (as a link to ninja) built from the master branch --- I know, I was asking for instability, but I wanted to see what does samurai needs for a new release --- instead of ninja version 1.12.1.
I do not have an exact response about why is this happening, though I could make a safe bet about it being because g++ may get an O.O.M. kill and samu doesn't get it, so it restarts the process and it continues running, getting killed, then running again, getting killed... And that's it.
I'm building Copacabana from Void Linux musl x86_64 using its build system, and the last time I updated it was on October 29th.
Much obliged for the attention and I would like to know if there's anything I could help in this project.
Not the maintainer but trying to help too; I don't think it's about signals. Did you actually see an OOM in the kernel logs? I tried killing a long-running compilation process with kill -9 and Samurai (master branch) correctly reacted with
samu: job failed with status 137: gcc [etc.]
samu: subcommand failed
Also, with respect to making a new release, can you check if this is a regression from the latest release (commit da43f8c7006a0d76048c110594dfc86a9f8c50de)? If it is we can try bisecting it.
I do not have an exact response about why is this happening, though I could make a safe bet about it being because g++ may get an O.O.M. kill and
samudoesn't get it, so it restarts the process and it continues running, getting killed, then running again, getting killed... And that's it.
This can't be it because samurai doesn't restart jobs. If a process was killed, if the number of failures is now above the failure limit (-k flag), it will wait for any running jobs to complete and then exit with failure. Otherwise, it will keep going with jobs that don't depend on the killed job.
As @bonzini said, can you see if you can reproduce with 1.2? Also, are you able to reproduce this consistently?
When the problem occurs, here's some other helpful info you could gather:
- The process tree when it is hung
- gdb backtrace of samu when it is hung
- strace output (attach with
strace -p $PID) of other process under samu in the process tree. If it is not samu but one of the jobs that is hanging, this might help identify why.
If you are able to reproduce this consistently, do you think you could come up with some self-contained way of reproducing, like a VM image and commands to run, or a builds.sr.ht script?
I suspect that it is simply a job that is hanging, but I don't have a guess as to why. Usually any build failures compared to ninja are due to the build manifest insufficiently describing dependencies between jobs and it working with ninja by accident. However, this usually results in an immediate build failure rather than a hang.
G'day, people. Pardon for the delay.
Not the maintainer but trying to help too; I don't think it's about signals. Did you actually see an OOM in the kernel logs?
@bonzini Well, I hadn't checked for these at the time.
Also, with respect to making a new release, can you check if this is a regression from the latest release (commit https://github.com/michaelforney/samurai/commit/da43f8c7006a0d76048c110594dfc86a9f8c50de)?
I had just downloaded GitHub's autogenerated master.tar.gz tarball and built from it. Nothing in particular that I can remember.
As @bonzini said, can you see if you can reproduce with 1.2? Also, are you able to reproduce this consistently?
@michaelforney I could reproduce it consistently more than three times in a row if I remember it correctly, it would hang, if not in object 1004, in object 1012 (as I cited above). That's not an exact measure, of course.
When the problem occurs, here's some other helpful info you could gather:
- The process tree when it is hung
- gdb backtrace of samu when it is hung
- strace output (attach with strace -p $PID) of other process under samu in the process tree. If it is not samu but one of the jobs that is hanging, this might help identify why.
I wish I knew this before... 😅 Well, in that exact configuration, it is extremely unlikely, since the virtual machine decided to leave this World per itself after an update.
If you are able to reproduce this consistently, do you think you could come up with some self-contained way of reproducing, like a VM image and commands to run, or a builds.sr.ht script?
I don't know any about builds.sr.ht --- I know about DeVault's sourcehut, but not about anything further on it ---, but a VM image would be sort of unpractical. Even though, since the process of building LLVM itself is part of a script (as I also said, it is part of Copacabana's build system), I think I can reproduce it in another scenario (running it natively on my machine instead of a VM). I'm busy lately, so I don't think I could make it too soon (not until the end of this year, possibly).
I suspect that it is simply a job that is hanging, but I don't have a guess as to why.
If that was the case, it wouldn't continue normally ~~until erring later on~~ on the vanilla, default ninja.
G'day, people. Pardon for the delay.
No worries, thanks for the update :)
I had just downloaded GitHub's autogenerated
master.tar.gztarball and built from it. Nothing in particular that I can remember.As @bonzini said, can you see if you can reproduce with 1.2? Also, are you able to reproduce this consistently?
@michaelforney I could reproduce it consistently more than three times in a row if I remember it correctly, it would hang, if not in object 1004, in object 1012 (as I cited above). That's not an exact measure, of course.
Ok. When you have the time, if you could try to see if you get the same issue with 1.2 that'd be helpful.
I suspect that it is simply a job that is hanging, but I don't have a guess as to why.
If that was the case, it wouldn't continue normally ~until erring later on~ on the vanilla, default
ninja.
Not necessarily. It's conceivable that a job only hangs when it is run in a certain order (perhaps before an unspecified dependency not listed in the manifest), and with typical job timings and job parallelism it's unlikely to show up with ninja.
I will try to build LLVM on a fresh Debian image and see if I can reproduce it myself.
I will try to build LLVM on a fresh Debian image and see if I can reproduce it myself.
That's the version that I'm using: https://github.com/llvm/llvm-project/releases/download/llvmorg-17.0.5/llvm-project-17.0.5.src.tar.xz And that's how it is being configured (just change "Unix Makefiles" for "Ninja"): https://github.com/Projeto-Pindorama/alambiko/blob/6b1deeac25251cf0f668a724c55d99315aafa97b/dev/LLVM/pkgbuild.ksh#L152-L159
Unfortunately, I'm having trouble reproducing this. Here's what I tried
- Using a fresh Debian sid image, I downloaded that llvm tarball and configured it with
cmake -G Ninja -B build -S llvm -Wno-dev -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_RUNTIMES='compiler-rt;libunwind;libcxx;libcxxabi' -DLLVM_ENABLE_PROJECTS='clang;lld'
- Then, I built with
samu -C buildusing the samurai master branch.
The build succeeded without issue, and notably got past tools/clang/lib/Sema/CMakeFiles/obj.clangSema.dir/SemaDeclObjC.cpp.o. Since you mentioned building from void musl, I also tried the same thing on my own linux system (oasis), which is musl based, which also succeeded.
Are there some other flags I should pass to cmake? The script you referenced has some flags to set C/CXX/LD flags, as well as a bunch of other flags in environment variables ($CT $CTG $CP $CRT $CLG $CLCPP $CLCPPA $CUW $CLLVM $COFF). I omitted those, but maybe they are important to reproduce the issue. Could you provide a single cmake command-line with environment variables expanded to try?
I'm not sure exactly how to run the copacabana build script myself. I looked at the website, but the link to the docs seems to be dead: http://tabula.pindorama.net.br/copacabana
After today (I still have some work to do at the uni and there's also preparations for Christmas) I will check on this. Pardon for the delay, I couldn't open my email in the last days.
EDIT:
I'm not sure exactly how to run the copacabana build script myself. I looked at the website, but the link to the docs seems to be dead: http://tabula.pindorama.net.br/copacabana
We didn't put this up yet again, the build system changed since 2021 (it became more manual), but I can give a quick tip on how to run it: you will need to check build-system/machine.ini and edit the settings to your needs, then just run build.ksh.
It shall work, but don't be worried with it for now because I will check on it as soon as possible.
Finally, free. Let me see...
Are there some other flags I should pass to cmake? The script you referenced has some flags to set C/CXX/LD flags, as well as a bunch of other flags in environment variables ($CT $CTG $CP $CRT $CLG $CLCPP $CLCPPA $CUW $CLLVM $COFF). I omitted those, but maybe they are important to reproduce the issue. Could you provide a single cmake command-line with environment variables expanded to try?
Well, it depends from case to case on the build file. I could do it, but I just realized it also depends on a previously built toolchain... I will have to do it myself here.
I know you guys already said to use strace and check at gdb --- which I think I can't do at first, but in a second run ---, is there anything else I would need to do to get a full report?
Finally, free. Let me see...
Are there some other flags I should pass to cmake? The script you referenced has some flags to set C/CXX/LD flags, as well as a bunch of other flags in environment variables ($CT $CTG $CP $CRT $CLG $CLCPP $CLCPPA $CUW $CLLVM $COFF). I omitted those, but maybe they are important to reproduce the issue. Could you provide a single cmake command-line with environment variables expanded to try?
Well, it depends from case to case on the build file. I could do it, but I just realized it also depends on a previously built toolchain... I will have to do it myself here. I know you guys already said to use
straceand check atgdb--- which I think I can't do at first, but in a second run ---, is there anything else I would need to do to get a full report?
A process tree (pstree) would also be helpful to indicate whether or not any subprocesses of samurai are still running.
Also, I forgot to ask before, when it is hung and you kill it, does it hang again when you start it again?
Thanks for you help and no worries about the delay, whenever you have the time is fine.
Also, I forgot to ask before, when it is hung and you kill it, does it hang again when you start it again?
Yes, but in a different "place" of the build (as I said before).
Small question that I forgot to do before: does it have any problem if I pstree my machine with another programs? I will try both on it and on a V.M., but I'm usually running other programs while compiling.
Yes, but in a different "place" of the build (as I said before).
The object number isn't relevant; there's no correlation between the object number and build command.
It seems it was hung at the same place in both cases, when building tools/clang/lib/Sema/CMakeFiles/obj.clangSema.dir/SemaDeclObjC.cpp.o. I'm interested in whether there is a subprocess for that compile command running when it is hung.
Small question that I forgot to do before: does it have any problem if I
pstreemy machine with another programs? I will try both on it and on a V.M., but I'm usually running other programs while compiling.
You can restrict the output to a particular process. If you can find the pid of samu, and then show the output of pstree -c $PID, that'd be helpful.
Also, if you run samu -v, it will show the full commands it is running. If it is hanging during one, you could try killing samurai and then running it yourself and see if it finishes.
The object number isn't relevant; there's no correlation between the object number and build command.
It seems it was hung at the same place in both cases, when building
tools/clang/lib/Sema/CMakeFiles/obj.clangSema.dir/SemaDeclObjC.cpp.o. I'm interested in whether there is a subprocess for that compile command running when it is hung.
That's really useful to know.
You can restrict the output to a particular process. If you can find the pid of samu, and then show the output of
pstree -c $PID, that'd be helpful.Also, if you run
samu -v, it will show the full commands it is running. If it is hanging during one, you could try killing samurai and then running it yourself and see if it finishes.
Wonderful, I will try it and then send the reports here. It will possibly take a while because, well, we need to build a toolchain-specific GCC before trying to build LLVM.
First of all, Merry Christmas to you guys.
I think I got a solution to this problem. In short, this is caused because the "native" ninja-build binary from the Ninja project conflicts with samu in CMake, which make it hangs for some reason I couldn't reproduce here because, well, I only have samurai installed.
In any case, here's my log:
strace_samu_CopaBuild.txt
It erred at the end, but it's my fault on LLVM's configuration, not yours.
Maybe a fix would be making samurai conflict with ninja in package managers' manifest files and/or linking samu to both ninja and ninja-build.
Since I've found the cause, I will mark this as closed.
I'm going to reopen this, since I'd still like to get to the bottom of this. It shouldn't hang even if you also have ninja installed.
I will try to reproduce with both ninja and samurai installed.
G'day, Michael --- at least it's already early morning/late, late night here in Brazil.
Right, there's a couple things I need to tell you before you keep experiment with it: First of all, I was running it on Void Linux musl and ninja build was installed from its official repositories; if I'm not erred, I've already said this, I will try to not be so prolix this time. Then, I've built LLVM using Copacabana's new build system, but you don't need to use it to reproduce completely, just redo some of its steps:
- First, mussel is compiled, it will be used as the C compiler to build LLVM later on, it is installed at
/cgnutools(a.k.a. the "prefix"); - Secondly, the Linux kernel headers (same place as mussel);
- Thirdly, we will build the musl C library, the zlib and the libatomic --- which is at Projeto-Pindorama/libreatomic, but it's just a fork without any modifications (for now) from Chimera Linux ---, these three are installed at
/llvmtools; - Fourthly, and lastly, we will go and compile libunwind and LLVM for
/cgnutoolsagain.
You don't need to read exactly Copacabana's pkgbuilds for these --- I admit they're pretty much big and that causes some confusion ---, you can read Derrick's (a.k.a. dslm4515) docs. on building LFS with Clang and Musl (a.k.a. CMLFS), which is where Copacabana is partially based from. There are, of course, many modifications and differences amid the two, but they're negligible at this very first stage. If you need anything and you think I can help, call me and I will try to clarify more things.
I wish you the best of luck and, of course, much obliged for this project.
And a (late) Happy New Year, of course.
@michaelforney I will build Copacabana again and I decided that this will be the perfect moment to "retest" this bug. I will keep you informed and send my logs here. This time I'm running on Mageia with the GNU C Library, not on Void; I also will be having a clean installation of samurai, without ninja aside.