Lmod load crashing in EasyBuild when building binutils
I think this will be in the block, but I'm not certain.
I cannot build foss toolchains that use binutils 2.42 because the binutils build fails (true for both GCC 12.3 and 13.3, though oddly on two separate systems that are the same CPU but different in RAM/GPU one builds and one does not). Specifically, it appears to be truncating the output of module load zlib and throwing a SIGABRT.
I (temporarily) modified the EasyBuild code to dump out cmd.sh and env.sh so that I could gather more information, after strace wasn't particularly enlightening.
Error is: free(): invalid next size (normal) which triggers the SIGABRT.
You can see that it is truncating: out.txt
See cmd.sh and env.sh: cmd.sh.txt env.sh.txt
Running the above by hand also triggers the SIGABRT.
The environment is Lmod with Tcl modulefiles on Rocky 8.10. If I comment out the _ModuleTable variables the command runs, but obviously I've then wiped the module table and it's no longer going to give me the correct output. The base binutils 2.42 against system built without issue.
Any suggestions for how I might fix this?
@Xaraxia That's an intriguing problem for sure!
Can you share some information on which version of Lmod you're using (module --version output), how it's configured (module --config output), and how Lmod was installed?
Which Lua and Tcl versions you have may also be relevant here.
It seems to be that this is really a problem with your Lmod installation (or Lua, or Tcl, ...), but we can try and help to figure it out. May be useful to get some input from the Lmod maintainers on this (@rtmclay and @mrcawood)...
Modules based on Lua: Version 8.7.7 2022-07-05 10:00 -05:00 by Robert McLay [email protected]
Installed with standard Rocky 8.10 RPMs. Lmod-8.7.7-1.el8.x86_64
Lua 5.3.4 Copyright (C) 1994-2017 Lua.org, PUC-Rio
lua-5.3.4-12.el8.x86_64 tcl-8.6.8-2.el8.x86_64
The interesting thing is that apart from GPU support the images are the same - it builds on some and not others.
Also of interest: It successfully built in the past with EasyBuild 4.9.4. I'm in the process of doing a complete stack rebuild after upgrading to 5 - partly because we've noticed that a number of easyconfigs have been modified in the last few years and that has occasionally caused issues where we mix old builds with new easyconfigs, and partly because we have upgraded slurm and ucx and wanted a fresher OpenMPI build.
I have been overriding $MODULEPATH manually to force the new build, so I'm going to pull a node out of the cluster, edit the files, and try again. (Edit: that didn't help)
Updating to Lmod 8.7.55-1 did not help.
Please update to the Latest version of Lmod (8.7.61). There have been recent changes to the TCL interface. Try that first. If that doesn't work then try setting LMOD_FAST_TCL_INTERP to no:
export LMOD_FAST_TCL_INTERP=no
before building binutils.
I grabbed the specfile from the RHEL/Fedora RPM and built Lmod 8.7.62. After installing that, binutils built. (The export listed above was not required.) Thank you @rtmclay for your efforts.
Do I assume that this means that TCL support for Lmod in EasyBuild 5 requires a newer Lmod version that is currently specified? If so, that's an update that needs to be made on the EasyBuild documentation.
I'll update our HPC image with the new Lmod build. For me, this is what I needed, thank you.
@Xaraxia It seems like you were hitting a bug in Lmod.
EasyBuild only requires a particular version of Lmod because we rely on certain features of it, not because we know it's 100% bug free.
Unless @rtmclay can point out a particular Lmod version in which this problem was fixed, I don't think there's much we can do (except for documenting the problem, as we're doing here).