software-layer icon indicating copy to clipboard operation
software-layer copied to clipboard

EESSI bash initialization to module file

Open TopRichard opened this issue 1 year ago • 14 comments

This is a follow up PR to the issue: https://gitlab.com/eessi/support/-/issues/83

TopRichard avatar Aug 12 '24 07:08 TopRichard

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

eessi-bot[bot] avatar Aug 12 '24 07:08 eessi-bot[bot]

Instance boegel-bot-deucalion is configured to build for:

  • architectures: aarch64/a64fx
  • repositories: eessi.io-2023.06-software

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software

eessi-bot[bot] avatar Aug 12 '24 07:08 eessi-bot[bot]

I'm not sure if it is really a concern, but we add an Lmod family() related to the EESSI version as well so that we have a way to manage "compatible" modules (for example module files related to dev.eessi.io)

ocaisa avatar Aug 12 '24 08:08 ocaisa

This also doesn't cover the cURL issue currently fixed in our initialisation script:

	  rhel_libcurl_file="/etc/pki/tls/certs/ca-bundle.crt"
          if [ -f $rhel_libcurl_file ]; then
            show_msg "Found libcurl CAs file at RHEL location, setting CURL_CA_BUNDLE"
            export CURL_CA_BUNDLE=$rhel_libcurl_file
          fi

You can use isFile() to give you the same logic

ocaisa avatar Aug 12 '24 08:08 ocaisa

The one other thing missing is the current redirection we do for Zen4:

https://github.com/EESSI/software-layer/blob/1fc0cb75a076a37184b9abddba2093118bdcfca6/init/eessi_environment_variables#L51-L59

ocaisa avatar Aug 12 '24 10:08 ocaisa

A quick test, including my suggested changes, shows that this is missing EESSI_CPU_FAMILY and EPREFIX. These seems to be the only ones of consequence

ocaisa avatar Aug 12 '24 10:08 ocaisa

Here's the one that works for me (but still misses EPREFIX and EESSI_CPU_FAMILY and the override for Zen4) and includes the cert bundle check and PS1 change:

help([[
Description
===========
The European Environment for Scientific Software Installations (EESSI, pronounced as easy) is a collaboration between different European partners in HPC community.The goal of this project is to build a common stack of scientific software installations for HPC systems and beyond, including laptops, personal workstations and cloud infrastructure. 

More information
================
 - URL: https://www.eessi.io/docs/
]])
whatis("Description: The European Environment for Scientific Software Installations (EESSI, pronounced as easy) is a collaboration between different European partners in HPC community. The goal of this project is to build a common stack of scientific software installations for HPC systems and beyond, including laptops, personal workstations and cloud infrastructure.")
whatis("URL: https://www.eessi.io/docs/:")

local eessi_version = myModuleVersion()
local eessi_repo = "/cvmfs/software.eessi.io"
local eessi_prefix = pathJoin(eessi_repo, "versions", eessi_version)
local eessi_os_type = "linux"
pushenv("EESSI_VERSION", eessi_version)
pushenv("EESSI_CVMFS_REPO", eessi_repo)
pushenv("EESSI_OS_TYPE", eessi_os_type)
function archdetect_cpu()
    local script = pathJoin(eessi_prefix, 'init', 'lmod_eessi_archdetect_wrapper.sh')
    if not os.getenv("EESSI_ARCHDETECT_OPTIONS") then
        if convertToCanonical(LmodVersion()) < convertToCanonical("8.6") then
            LmodError("Loading this modulefile requires using Lmod version > 8.6, but you can export EESSI_ARCHDETECT_OPTIONS to the available cpu architecture in the form of: x86_64/intel/haswell or aarch64/neoverse_v1")
        end
        source_sh("bash", script)
    end
    for archdetect_filter_cpu in string.gmatch(os.getenv("EESSI_ARCHDETECT_OPTIONS"), "([^" .. ":" .. "]+)") do
        if isDir(pathJoin(eessi_prefix, "software", eessi_os_type, archdetect_filter_cpu, "software")) then
            return archdetect_filter_cpu
        end
    end
    LmodError("Software directory check for the detected architecture failed")
end
local archdetect = archdetect_cpu()
local eessi_cpu_family = archdetect:match("([^/]+)")
local eessi_software_subdir = os.getenv("EESSI_SOFTWARE_SUBDIR_OVERRIDE") or archdetect
local eessi_eprefix = pathJoin(eessi_prefix, "compat", eessi_os_type, eessi_cpu_family)
local eessi_software_path = pathJoin(eessi_prefix, "software", eessi_os_type, eessi_software_subdir)
local eessi_module_path = pathJoin(eessi_software_path, "modules", "all")
local eessi_site_module_path = string.gsub(eessi_module_path, "versions", "host_injections")
pushenv("EESSI_SITE_MODULEPATH", eessi_site_module_path)
pushenv("EESSI_SOFTWARE_SUBDIR", eessi_software_subdir)
pushenv("EESSI_PREFIX", eessi_prefix)
pushenv("EESSI_EPREFIX", eessi_eprefix)
prepend_path("PATH", pathJoin(eessi_eprefix, "bin"))
prepend_path("PATH", pathJoin(eessi_eprefix, "usr/bin"))
pushenv("EESSI_SOFTWARE_PATH", eessi_software_path)
pushenv("EESSI_MODULEPATH", eessi_module_path)
prepend_path("MODULEPATH", eessi_module_path)
prepend_path("MODULEPATH", eessi_site_module_path)
pushenv("LMOD_CONFIG_DIR", pathJoin(eessi_software_path, ".lmod"))
pushenv("LMOD_PACKAGE_PATH", pathJoin(eessi_software_path, ".lmod"))
-- update the prompt (unless overridden)
if not os.getenv("EESSI_RETAIN_PROMPT") then
  pushenv("PS1", "{EESSI " .. eessi_version .. "} " .. (os.getenv("PS1") or ""))
end
-- check for RHEL certificate locatioin
local rhel_certificates = "/etc/pki/tls/certs/ca-bundle.crt"
if isFile(rhel_certificates) then
  pushenv("CURL_CA_BUNDLE", rhel_certificates)
end 
if mode() == "load" then
    LmodMessage("EESSI/" .. eessi_version .. " loaded successfully")
end

ocaisa avatar Aug 12 '24 11:08 ocaisa

Setting the PS1 only would create difficulties for prompts, which don't use the PS1 Variable (Starship, OhMyPosh, etc). An easy solution could be the following

if not os.getenv("EESSI_RETAIN_PROMPT") then
-  pushenv("PS1", "{EESSI " .. eessi_version .. "} " .. (os.getenv("PS1") or ""))
+ local eessi_prompt = "{EESSI " .. eessi_version .. "}"
+ pushenv("PS1", eessi_prompt .. (os.getenv("PS1") or ""))
+ pushenv("EESSI_PROMPT", eessi_prompt)
end

But the os.getenv("PS1") still results in nil and so the original prompt will be gone. I have no clue why the variable could not be read by lua, since the bash script has no issues with it.

MaKaNu avatar Aug 12 '24 11:08 MaKaNu

The one other thing missing is the current redirection we do for Zen4:

https://github.com/EESSI/software-layer/blob/1fc0cb75a076a37184b9abddba2093118bdcfca6/init/eessi_environment_variables#L51-L59

Handled in the commit above

TopRichard avatar Aug 12 '24 13:08 TopRichard

Setting the PS1 only would create difficulties for prompts, which don't use the PS1 Variable (Starship, OhMyPosh, etc). An easy solution could be the following

if not os.getenv("EESSI_RETAIN_PROMPT") then
-  pushenv("PS1", "{EESSI " .. eessi_version .. "} " .. (os.getenv("PS1") or ""))
+ local eessi_prompt = "{EESSI " .. eessi_version .. "}"
+ pushenv("PS1", eessi_prompt .. (os.getenv("PS1") or ""))
+ pushenv("EESSI_PROMPT", eessi_prompt)
end

But the os.getenv("PS1") still results in nil and so the original prompt will be gone. I have no clue why the variable could not be read by lua, since the bash script has no issues with it.

We can only set PS1 if it is in the environment already

if os.getenv("PS1") and not os.getenv("EESSI_RETAIN_PROMPT") then

os.getenv() returns nil if a variable is not set

ocaisa avatar Aug 12 '24 13:08 ocaisa

We can only set PS1 if it is in the environment already

if os.getenv("PS1") and not os.getenv("EESSI_RETAIN_PROMPT") then

That's the Thing I don't get. In my test scenario, PS1 was set and was also readable by the bash script, but os.getenv("PS1") returns the nil.

I believe the problem is that the PS1 variable can't be exported, so it's not available in the module. Or at least the basic ubuntu PS1=${debian_chroot:+($debian_chroot)}\u@\h:\w\$ seems to be an issue. If I provide a custom PS1 the module appends as expected.

EDIT: The issue is that directly exporting doesn't work, if I put export PS1=$PS1 in front of module load it works as expeceted.

MaKaNu avatar Aug 12 '24 13:08 MaKaNu

Might it be possible that your jobs already fail at test "Test for archdetect_cpu functionality with only one valid path" and not as indicated at test "Test for archdetect_cpu functionality with invalid path".

/home/runner/work/_temp/c11c9c6f-66bb-4412-991a-68d31eb8184c.sh: line 4: module: command not found

This causes the if-statement to fail, but the job task succeeds.

I can't find the task, which source the Lmod init script.

MaKaNu avatar Aug 16 '24 06:08 MaKaNu

@TopRichard, @MaKaNu is right, you don't have an initialised Lmod when you run the tests. You'll need to add another step that does that, see https://github.com/EESSI/software-layer/blob/b14cef843a13e51c2871fd9a4d8e4d7e74cb349a/init/bash from @MaKaNu PR

ocaisa avatar Aug 16 '24 07:08 ocaisa

Could it be that:

- name: Check out software-layer repository
        uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1

      - name: Mount EESSI CernVM-FS pilot repository
        uses: cvmfs-contrib/github-action-cvmfs@55899ca74cf78ab874bdf47f5a804e47c198743c # v4.0
        with:
          cvmfs_config_package: https://github.com/EESSI/filesystem-layer/releases/download/latest/cvmfs-config-eessi_latest_all.deb
          cvmfs_http_proxy: DIRECT
          cvmfs_repositories: software.eessi.io

Loads the release version of eessi and so your changes to the init system are not available?

MaKaNu avatar Aug 26 '24 11:08 MaKaNu

Could it be that:

- name: Check out software-layer repository
        uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1

      - name: Mount EESSI CernVM-FS pilot repository
        uses: cvmfs-contrib/github-action-cvmfs@55899ca74cf78ab874bdf47f5a804e47c198743c # v4.0
        with:
          cvmfs_config_package: https://github.com/EESSI/filesystem-layer/releases/download/latest/cvmfs-config-eessi_latest_all.deb
          cvmfs_http_proxy: DIRECT
          cvmfs_repositories: software.eessi.io

Loads the release version of eessi and so your changes to the init system are not available?

Yes exactly, so the module EESSI/2023.06 is not available, this should work out after merging the PR

TopRichard avatar Aug 26 '24 11:08 TopRichard

Looks like the test is still failing. Maybe show the result file to verify if the desired string is included?

trz42 avatar Aug 26 '24 12:08 trz42

Looks like the test is still failing. Maybe show the result file to verify if the desired string is included?

Scratch that. The module file to be loaded will only be added by this PR. Hence it will never be able to load it from CVMFS before the module file has been ingested.

trz42 avatar Aug 26 '24 12:08 trz42

bot: build repo:eessi.io-2023.06-software arch:x86_64/generic

TopRichard avatar Aug 26 '24 12:08 TopRichard

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/generic from TopRichard

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/generic
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/generic resulted in:

    • submitted job 17041, for details & status see https://github.com/EESSI/software-layer/pull/667#issuecomment-2310154569

eessi-bot[bot] avatar Aug 26 '24 12:08 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • account TopRichard has NO permission to send commands to the bot

eessi-bot[bot] avatar Aug 26 '24 12:08 eessi-bot[bot]

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account TopRichard has NO permission to send commands to the bot

New job on instance eessi-bot-mc-aws for architecture x86_64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_667/17041

date job status comment
Aug 26 12:59:06 UTC 2024 submitted job id 17041 awaits release by job manager
Aug 26 12:59:09 UTC 2024 released job awaits launch by Slurm scheduler
Aug 26 13:05:20 UTC 2024 running job 17041 is running
Aug 26 13:24:07 UTC 2024 finished
:grin: SUCCESS (click triangle for details)
Details
:white_check_mark: job output file slurm-17041.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-generic-1724677474.tar.gzsize: 0 MiB (1469 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/generic/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/generic/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/generic
2023.06/init/modules/EESSI/2023.06.lua
Aug 26 13:24:07 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 18/18 test case(s) from 18 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-17041.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar Aug 26 '24 12:08 eessi-bot[bot]

@trz42 the module file is included in the tarball.

TopRichard avatar Aug 26 '24 17:08 TopRichard

Looks like the test is still failing. Maybe show the result file to verify if the desired string is included?

Scratch that. The module file to be loaded will only be added by this PR. Hence it will never be able to load it from CVMFS before the module file has been ingested.

You can't load it from CVMFS, but you can

module use init/modules

and perform the tests

ocaisa avatar Aug 26 '24 18:08 ocaisa

Looks like the test is still failing. Maybe show the result file to verify if the desired string is included?

Scratch that. The module file to be loaded will only be added by this PR. Hence it will never be able to load it from CVMFS before the module file has been ingested.

You can't load it from CVMFS, but you can

module use init/modules

and perform the tests

Agreed. This sounds like the correct way to test the module file. If it succeeds we can ingest it. If it fails (now or a future changed version), we must not ingest it.

trz42 avatar Aug 26 '24 20:08 trz42

Ok, I think this is pretty much there. One thing I wonder is whether we should be deploying this to a higher level directory? The module file is version specific already so there is no real requirement to deploy it under the versions directory, it could be deployed under

/cvmfs/software.eessi.io/init

and that would make it much easier to have multiple EESSI versions available at once

ocaisa avatar Aug 27 '24 08:08 ocaisa

bot: build repo:eessi.io-2023.06-software arch:x86_64/generic

TopRichard avatar Aug 27 '24 12:08 TopRichard

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/generic from TopRichard

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/generic
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/generic resulted in:

    • submitted job 17103, for details & status see https://github.com/EESSI/software-layer/pull/667#issuecomment-2312461672

eessi-bot[bot] avatar Aug 27 '24 12:08 eessi-bot[bot]

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account TopRichard has NO permission to send commands to the bot