easybuild-easyblocks icon indicating copy to clipboard operation
easybuild-easyblocks copied to clipboard

Lmod load crashing in EasyBuild when building binutils

Open Xaraxia opened this issue 7 months ago • 6 comments

I think this will be in the block, but I'm not certain.

I cannot build foss toolchains that use binutils 2.42 because the binutils build fails (true for both GCC 12.3 and 13.3, though oddly on two separate systems that are the same CPU but different in RAM/GPU one builds and one does not). Specifically, it appears to be truncating the output of module load zlib and throwing a SIGABRT.

I (temporarily) modified the EasyBuild code to dump out cmd.sh and env.sh so that I could gather more information, after strace wasn't particularly enlightening.

Error is: free(): invalid next size (normal) which triggers the SIGABRT.

You can see that it is truncating: out.txt

See cmd.sh and env.sh: cmd.sh.txt env.sh.txt

Running the above by hand also triggers the SIGABRT.

The environment is Lmod with Tcl modulefiles on Rocky 8.10. If I comment out the _ModuleTable variables the command runs, but obviously I've then wiped the module table and it's no longer going to give me the correct output. The base binutils 2.42 against system built without issue.

Any suggestions for how I might fix this?

Xaraxia avatar Jun 17 '25 07:06 Xaraxia

@Xaraxia That's an intriguing problem for sure!

Can you share some information on which version of Lmod you're using (module --version output), how it's configured (module --config output), and how Lmod was installed?

Which Lua and Tcl versions you have may also be relevant here.

It seems to be that this is really a problem with your Lmod installation (or Lua, or Tcl, ...), but we can try and help to figure it out. May be useful to get some input from the Lmod maintainers on this (@rtmclay and @mrcawood)...

boegel avatar Jun 18 '25 07:06 boegel

Modules based on Lua: Version 8.7.7 2022-07-05 10:00 -05:00 by Robert McLay [email protected]

module--config.txt

Installed with standard Rocky 8.10 RPMs. Lmod-8.7.7-1.el8.x86_64

Lua 5.3.4 Copyright (C) 1994-2017 Lua.org, PUC-Rio

lua-5.3.4-12.el8.x86_64 tcl-8.6.8-2.el8.x86_64

The interesting thing is that apart from GPU support the images are the same - it builds on some and not others.

Also of interest: It successfully built in the past with EasyBuild 4.9.4. I'm in the process of doing a complete stack rebuild after upgrading to 5 - partly because we've noticed that a number of easyconfigs have been modified in the last few years and that has occasionally caused issues where we mix old builds with new easyconfigs, and partly because we have upgraded slurm and ucx and wanted a fresher OpenMPI build.

I have been overriding $MODULEPATH manually to force the new build, so I'm going to pull a node out of the cluster, edit the files, and try again. (Edit: that didn't help)

Xaraxia avatar Jun 25 '25 00:06 Xaraxia

Updating to Lmod 8.7.55-1 did not help.

Xaraxia avatar Jun 25 '25 01:06 Xaraxia

Please update to the Latest version of Lmod (8.7.61). There have been recent changes to the TCL interface. Try that first. If that doesn't work then try setting LMOD_FAST_TCL_INTERP to no:

export LMOD_FAST_TCL_INTERP=no

before building binutils.

rtmclay avatar Jun 25 '25 17:06 rtmclay

I grabbed the specfile from the RHEL/Fedora RPM and built Lmod 8.7.62. After installing that, binutils built. (The export listed above was not required.) Thank you @rtmclay for your efforts.

Do I assume that this means that TCL support for Lmod in EasyBuild 5 requires a newer Lmod version that is currently specified? If so, that's an update that needs to be made on the EasyBuild documentation.

I'll update our HPC image with the new Lmod build. For me, this is what I needed, thank you.

Xaraxia avatar Jun 27 '25 00:06 Xaraxia

@Xaraxia It seems like you were hitting a bug in Lmod.

EasyBuild only requires a particular version of Lmod because we rely on certain features of it, not because we know it's 100% bug free.

Unless @rtmclay can point out a particular Lmod version in which this problem was fixed, I don't think there's much we can do (except for documenting the problem, as we're doing here).

boegel avatar Jul 02 '25 13:07 boegel