software-layer
software-layer copied to clipboard
Bot-specific `SitePackage.lua` that solves `libfabric` issues
With help from @casparvl, I've added the following to /project/def-users/bot/shared/host-injections/2023.06/.lmod/SitePackage.lua on our AWS build cluster, which will be picked up by the bot for builds relying on libfabric:
require("strict")
local hook = require("Hook")
-- LmodMessage("Load bot-specific SitePackage.lua")
local function eessi_bot_libfabric_set_psm3_devices_hook(t)
local simpleName = string.match(t.modFullName, "(.-)/")
-- we may want to be more specific in the future, and only do this for specific versions of libfabric
if simpleName == 'libfabric' then
-- set environment variables PSM3_DEVICES as workaround for MPI applications hanging in libfabric's PSM3 provider
-- crf. https://github.com/easybuilders/easybuild-easyconfigs/issues/18925
setenv('PSM3_DEVICES', 'self,shm')
end
end
-- combine all load hook functions into a single one
function site_specific_load_hook(t)
eessi_bot_libfabric_set_psm3_devices_hook(t)
end
local function combined_load_hook(t)
-- Assuming this was called from EESSI's SitePackage.lua, this should be defined and thus run
if eessi_load_hook ~= nil then
eessi_load_hook(t)
end
site_specific_load_hook(t)
end
hook.register("load", combined_load_hook)
This solves the Haswell OpenMPI issues that we observed in several PRs. I was going to make a PR for it, but I have some doubts on how this should be done:
- does it have to be restricted to Haswell (we also saw some hangs with other architectures, but it's not entirely clear if they were caused by the same issue)?
- does it have to be restricted to certain versions of
libfabric? - do we also need this for the tests? Answer fron @casparvl: yes, might be needed.
- which script should make sure that this
SitePackage.luais picked up / copied to the right location?bot/build.sh,EESSI-install-software.sh,eessi_container.sh, ...? - what if a PR wants to update
SitePackage.lua, should it already pick up the new version? If so, we should probably prevent it from being copied to the shared directory already, otherwise other builds will also pick it up already before it's merged.
- I wouldn't restrict it to only Haswell on our build cluster, since
libfabricis essentially irrelevant there (at runtime). - We could restrict it to specific version of
libfabric(since it seems to be a bug there?) - We may also need it for the test suite, yes, but then I would deal with that in the test suite repo?
- I would only put the hook in place during the
buildphase, sobot/build.sh - If the
SitePackage.luais put in place viabot/build.sh, then changes to it should only get picked up by the PR, and should be isolated to that PR?
Same approach could be used for other problems that are triggered via libfabric, see https://github.com/easybuilders/easybuild-easyconfigs/issues/20233
@TopRichard also found an issue with our CUDA hook when trying to use it on NESSI, it will currently forbid the loading of dependency modules that have GPU support even for building purposes. Disabling that hook as part of the bot-specific SitePackage.lua seems like a good idea.
In order to fix similar kind of MPI issues on our zen4 cluster (see https://github.com/EESSI/software-layer/pull/815), I added the following file to the bot account:
$ cat /project/def-users/bot/shared/host-injections/2023.06/.lmod/SitePackage.lua
require("strict")
local hook = require("Hook")
-- LmodMessage("Load bot-specific SitePackage.lua")
local function eessi_bot_libfabric_set_psm3_devices_hook(t)
local simpleName = string.match(t.modFullName, "(.-)/")
-- we may want to be more specific in the future, and only do this for specific versions of libfabric
if simpleName == 'libfabric' then
-- set environment variable FI_PROVIDER as workaround for MPI applications hanging in libfabric's PSM3 provider
-- crf. https://github.com/easybuilders/easybuild-easyconfigs/issues/18925
setenv('FI_PROVIDER', '^psm3')
end
end
-- combine all load hook functions into a single one
function site_specific_load_hook(t)
eessi_bot_libfabric_set_psm3_devices_hook(t)
end
local function combined_load_hook(t)
-- Assuming this was called from EESSI's SitePackage.lua, this should be defined and thus run
if eessi_load_hook ~= nil then
eessi_load_hook(t)
end
site_specific_load_hook(t)
end
hook.register("load", combined_load_hook)
We haven't actually implemented this fix, and ran into it again in #966
@bedroge The custom site package you mention is not actually there under
/project/def-users/bot/shared/host-injections/2023.06/.lmod
I suspect it may have been forgotten when the cluster was rebuilt?
I've restored it as it was detailed in the first comment here.
I just ran into this again!