Multiversioning is order dependent?
I'm on a HPC system with a few different architectures:
Login node is skylake
julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
Threads: 1 on 32 virtual cores
Environment:
LD_LIBRARY_PATH = /central/software/julia/1.9.0/lib:/central/software/CUDA/11.8/lib64:/central/software/CUDA/11.8/extras/CUPTI/lib64:/central/software/CUDA/11.8/targets/x86_64-linux/lib
LD_RUN_PATH = /central/software/CUDA/11.8/lib64:/central/software/CUDA/11.8/extras/CUPTI/lib64:/central/software/CUDA/11.8/targets/x86_64-linux/lib
and a broadwell compute node
julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 28 × Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, broadwell)
Threads: 1 on 28 virtual cores
Environment:
LD_LIBRARY_PATH = /central/software/CUDA/11.8/lib64:/central/software/CUDA/11.8/extras/CUPTI/lib64:/central/software/CUDA/11.8/targets/x86_64-linux/lib:/central/software/julia/1.9.0/lib:/central/slurm/install/current/lib/
LD_RUN_PATH = /central/software/CUDA/11.8/lib64:/central/software/CUDA/11.8/extras/CUPTI/lib64:/central/software/CUDA/11.8/targets/x86_64-linux/lib
I'm calling Pkg.precompile() on the login node, then using CUDA on the compute node.
1. the default
If I don't set anything, then loading CUDA on the compute node will trigger precompilation again. Setting JULIA_DEBUG=all, I get the following warning
┌ Debug: Rejecting cache file /home/spjbyrne/.julia/compiled/v1.9/CUDA/oWw5k_OHRW8.ji for CUDA [052768ef-5323-5732-b1bb-66c8b64840ba] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2706
┌ Debug: Precompiling CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]
└ @ Base loading.jl:2140
(and similar for CUDA.jl's deps)
2. setting JULIA_CPU_TARGET=broadwell
Since broadwell is supported by both nodes, this seems to work as intended. I do get the following warning:
┌ Debug: Rejecting cache file /central/software/julia/1.9.0/share/julia/compiled/v1.9/Statistics/ERcPL_Stp2R.ji for Statistics [10745b16-79ce-11e8-11f9-7d13ad32a3b2] since the flags are mismatched
│ current session: use_pkgimages = true, debug_level = 1, check_bounds = 0, inline = true, opt_level = 2
│ cache file: use_pkgimages = true, debug_level = 1, check_bounds = 1, inline = true, opt_level = 2
└ @ Base loading.jl:2690
but it doesn't appear to cause any issues (perhaps since Statistics isn't built as a pkgimage?).
3. setting JULIA_CPU_TARGET='skylake;broadwell'
This does not appear to work, and gives the same behavior as 1:
┌ Debug: Rejecting cache file /home/spjbyrne/.julia/compiled/v1.9/CUDA/oWw5k_Qcjfa.ji for CUDA [052768ef-5323-5732-b1bb-66c8b64840ba] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2706
┌ Debug: Precompiling CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]
└ @ Base loading.jl:2140
(and similar for dependencies)
cc @vchuravy
So 2. is not an issue, we just checked another cache file on the way.
For 3. Could you try: JULIA_CPU_TARGET='broadwell;skylake'?
One thing we discussed is to safe the cpu_target string of the sysimg and use that as a default for pkgimages.
This would mitigate 1., but would increase cache-time.
@simonbyrne for you this would be identical to setting: generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1).
x-ref: https://github.com/JuliaCI/julia-buildkite/issues/298
For
3.Could you try: JULIA_CPU_TARGET='broadwell;skylake'?
Yes, that appears to work (in that it doesn't trigger recompilation).
this would be identical to setting:
generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1).
That also works.
So from https://docs.julialang.org/en/v1/devdocs/sysimg/#Specifying-multiple-system-image-targets
By default, only functions that are the most likely to benefit from the microarchitecture features will be cloned.
and
By default, a partially cloned (i.e. not
clone_all) target will use functions from the default target (first one specified) if a function is not cloned.
E.g. 'skylake;broadwell' Takes skylake as the base-image and then it may compile some functions for broadwell as an extension.
Which leads to something that is not loadable on broadwell.
@pchintalapudi raised the point offline that this is a non-ideal default and we probably should make clone_all the default.
This may be related to #54464 where we found that JULIA_CPU_TARGET basically only compiles for the first target (which would explain why it's order-dependent).
IIUC, the order-dependence in this issue is intentional. You can argue clone_all should be the default, but I think the current behavior makes sense. #54464 looks like a different issue.