easybuild-framework icon indicating copy to clipboard operation
easybuild-framework copied to clipboard

run python in the same process as `eb` wrapper script by using `exec`

Open Flamefire opened this issue 1 year ago • 3 comments

Use exec in the eb wrapper script to avoid creating a new process. This allows easier work with e.g. SLURM as signals send to the main process (eb) may not be forwarded to Easybuild (python) which results in e.g. stale locks when the process is then later force-killed.

Flamefire avatar Aug 02 '22 10:08 Flamefire

Rebased

Flamefire avatar Aug 04 '22 06:08 Flamefire

This change looks harmless, but when playing with it on our systems I actually run into problems because of this... It's not entirely clear to me what's going wrong here, but it has something to do with the environment in which EasyBuild is being run not being exactly the same...

== Temporary log file in case of crash /tmp/eb-ypv4xy3v/easybuild-12uyhfz0.log
ERROR: Failed to process easyconfig /tmp/CFITSIO-4.1.0-GCCcore-11.3.0.eb: Module command '/usr/share/lmod/lmod/libexec/lmod python --terse --show-hidden avail ' failed with exit code 1; stderr: no such variable
    (read trace on "env(_)")
    invoked from within
"string match "*tcl2lua.tcl" $env(_)"
    (file "/etc/modulefiles/vsc/cluster/.modulerc" line 4)
    invoked from within
"source $mRcFile"
    (procedure "main" line 15)
    invoked from within
"main $fn"
    (file "/usr/share/lmod/lmod/libexec/RC2lua.tcl" line 137)
Lmod has detected the following error: Unable to parse: "/etc/modulefiles/vsc/cluster/.modulerc". Aborting!

If you don't understand the warning or error, contact the helpdesk at [email protected]


; stdout: _mlstatus = False

Here's the contents of the .modulerc file:

#%Module1.0
# Legacy modulerc. Lmod should always take default.lua first. It's only here to ensure
# falling back to environment-modules keeps on working.
if { ![string match "*tcl2lua.tcl" $env(_)] } {
    if {[info exists ::env(VSC_DEFAULT_CLUSTER_MODULE)]} {
        module-version cluster/$::env(VSC_DEFAULT_CLUSTER_MODULE) default
    } else {
        puts stderr "The default cluster module cannot be determined. Please set \$VSC_DEFAULT_CLUSTER_MODULE."
        exit 1
    }
}

The $env(_) part is where it tries to figure out how this file is being processed (via the tcl2lua script in Lmod, or not), so somehow the value of the $_ environment variable is different with and without using exec in the eb wrapper script?

boegel avatar Aug 05 '22 15:08 boegel

The $env(_) part is where it tries to figure out how this file is being processed (via the tcl2lua script in Lmod, or not), so somehow the value of the $_ environment variable is different with and without using exec in the eb wrapper script?

In fact it is not set at all as tested via a python script printing os.environ. That variable is a bit special: https://unix.stackexchange.com/questions/280453/understand-the-meaning-of

The actual bug in your combination of config, lmod, this change, etc is that /usr/share/lmod/lmod/libexec/lmod python --terse --show-hidden avail as run by EB isn't run in a shell and hence $_ doesn't get set. So the inherited value from EB is used which isn't there either anymore after this change (exec transitions the process so there is no "last command" I guess.)

I added a commit which readds this variable in main. Hope that helps.

Flamefire avatar Aug 15 '22 09:08 Flamefire