avx512 features no longer detected in target images in v1.11
I use julia on a heterogeneous compute cluster, consisting of many different nodes with different CPUs, but using a single shared file system. This has created problems before: In 1.10, precompilation is triggered each time a project is used on a different node. However, this was easily fixed by using separate projects. In 1.11 this no longer seems to be the case.
For example, let's say I create two projects, env1 and env2. I load env1 on my workstation (Intel Xeon W-2223), add a package (say JLD2) and precompile. Then I load env2 on a compute node (AMD EPYC 7302) and add the same package. Despite the different CPU, no precompilation is triggered. Then, when I try to run some code, julia crashes on an invalid instruction:
julia> using JLD2
julia> jldsave("test.jld2", a=rand(100))
Invalid instruction at 0x1552538da346: 0x62, 0xf2, 0xfd, 0x28, 0x7c, 0xc0, 0xc4, 0xc1, 0x7e, 0x7f, 0x44, 0x24, 0x10, 0x4d, 0x89
[1482091] signal 4 (2): Illegal instruction
in expression starting at REPL[3]:1
MmapIO at /home/sschult/.julia/packages/JLD2/3zWRM/src/io/mmapio.jl:14 [inlined]
MmapIO at /home/sschult/.julia/packages/JLD2/3zWRM/src/io/mmapio.jl:113
openfile at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:146 [inlined]
openfile at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:151
#jldopen#22 at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:215
jldopen at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:164 [inlined]
#jldopen#23 at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:286 [inlined]
jldopen at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:279 [inlined]
#jldsave#107 at /home/sschult/.julia/packages/JLD2/3zWRM/src/loadsave.jl:286
jldsave at /home/sschult/.julia/packages/JLD2/3zWRM/src/loadsave.jl:283 [inlined]
jldsave at /home/sschult/.julia/packages/JLD2/3zWRM/src/loadsave.jl:283
unknown function (ip: 0x15525cb97c66)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
do_call at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:663
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:821
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
eval_user_input at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:226
repl_backend_loop at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:323
#start_repl_backend#59 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:308
start_repl_backend at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:305
#run_repl#72 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:464
run_repl at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:450
jfptr_run_repl_10212 at /home/sschult/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
#1138 at ./client.jl:446
jfptr_YY.1138_14881 at /home/sschult/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-1/julialang/julia-master/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1054 [inlined]
invokelatest at ./essentials.jl:1051 [inlined]
run_main_repl at ./client.jl:430
repl_main at ./client.jl:567 [inlined]
_start at ./client.jl:541
jfptr__start_72051.1 at /home/sschult/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
true_main at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:1059
main at /cache/build/builder-amdci5-1/julialang/julia-master/cli/loader_exe.c:58
__libc_start_call_main at /lib64/libc.so.6 (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 1547120 (Pool: 1547037; Big: 83); GC: 3
Illegal instruction (core dumped)
The instruction in question appears to be vpbroadcastq from the AVX-512 instruction set, which, indeed, the Intel Xeon W-2223 supports and the AMD EPYC 7302 does not.
- This works without errors in 1.10.5.
- This problem is not specific to JLD2. I have obtained similar errors when using, for example, CairoMakie.jl, CUDA.jl or Arrow.jl, which error on other instructions from the same set.
- If I instead use
env2on another node with e.g. an Intel Xeon E5-2698 v3, which also does not support AVX-512, the behaviour is different: precompilation is triggered, and no error is thrown. - If I delete .julia/compiled/v1.11, and first load
env2on the compute node, everything works fine, until I useenv1on my workstation, after which the same error occurs on the compute node.
This code shouldn't have been allowed to load :. @vchuravy
For debugging this, before loading JLD2 do
ENV["JULIA_DEBUG"] = "loading"
using JLD2
You should see a line like
┌ Debug: Loading object cache file /depot/compiled/v1.11/JLD2/bla_blah.so for JLD2 [...]
└ @ Base loading.jl:1203
Then run (basically you need to replace the extension .so of the object cache file above with .ji):
Base.parse_image_targets(Base.parse_cache_header("/depot/compiled/v1.11/JLD2/bla_blah.ji")[7])
What do you get here?
julia> Base.parse_image_targets(Base.parse_cache_header("/home/sschult/.julia/compiled/v1.11/JLD2/O1EyT_NIQbS.ji")[7])
1-element Vector{Base.ImageTarget}:
cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, sahf, lzcnt, prfchw, xsavec, xsaves)
Ok, to be clear, you should do that test in the situation where you get the "wrong" JLD2 with the incompatible code. This image doesn't seem to have the avx512 feature (assuming JLD2 is indeed the offending package here).
That's exactly what I did. In fact, the same .so file is reported on both machines, and the output is exactly the same.
In 1.10, I get
julia> Base.parse_image_targets(Base.parse_cache_header("/home/sschult/.julia/compiled/v1.10/JLD2/O1EyT_NQjXZ.ji")[7])
1-element Vector{Base.ImageTarget}:
cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, clflushopt, clwb, avx512cd, avx512bw, avx512vl, avx512vnni, sahf, lzcnt, prfchw, xsavec, xsaves)
which does include avx512, so I guess something causes this to be stored incorrectly in 1.11.
After some further testing, I found that the issue appears first in 1.11.0-rc4, with rc3 unaffected. Also, in rc3, Base.current_image_target() correctly contains the avx512 features, whereas in rc4 it does not:
:~> julia +1.11.0-rc3 -e "println(Base.current_image_targets())"
Base.ImageTarget[cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, clflushopt, clwb, avx512cd, avx512bw, avx512vl, avx512vnni, sahf, lzcnt, prfchw, xsavec, xsaves)]
:~> julia +1.11.0-rc4 -e "println(Base.current_image_targets())"
Base.ImageTarget[cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, sahf, lzcnt, prfchw, xsavec, xsaves)]
That's very interesting. Would you be able to run git bisect? This is the diff: https://github.com/JuliaLang/julia/compare/v1.11.0-rc3...v1.11.0-rc4, there are only 35 commits between the two versions, but honestly at a quick glance I can't spot a change which would affect that.
I don't have the time to git bisect right now, might be able to do it later, but I can confirm avx512 feature is gone also on skylake-avx512:
$ julia -E 'Base.current_image_targets()'
Base.ImageTarget[skylake-avx512; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, pku, sahf, lzcnt, prfchw, xsavec, xsaves)]
From checking the automatic builds 50c1ea848579ddc99e3c3633b85669988c2c89f2 appears to be the first commit affected.
It'd be very surprising if that was the commit affecting this, I can't see that affecting features detection: current_image_targets is simply parsing coming from the C function jl_reflect_clone_targets, which that commit doesn't touch: https://github.com/JuliaLang/julia/blob/1f935afc07edde9f8c2e1a0f05d4772e18a55e97/base/loading.jl#L1735-L1738
For the record, the issue seems to be solved on master (d36417b8230), it affects only v1.11, we need to find what fixed it (besides what caused it):
$ julia +1.10 -E 'Base.current_image_targets()'
Base.ImageTarget[znver3; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, avx512ifma, clflushopt, clwb, avx512cd, sha, avx512bw, avx512vl, avx512vbmi, pku, avx512vbmi2, shstk, gfni, vaes, vpclmulqdq, avx512vnni, avx512bitalg, avx512vpopcntdq, rdpid, sahf, lzcnt, sse4a, prfchw, mwaitx, xsavec, xsaves, clzero, wbnoinvd, avx512bf16)]
$ julia +1.11 -E 'Base.current_image_targets()'
Base.ImageTarget[znver4; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, sha, pku, shstk, gfni, vaes, vpclmulqdq, rdpid, sahf, lzcnt, sse4a, prfchw, mwaitx, xsavec, xsaves, clzero, wbnoinvd)]
$ julia +nightly -E 'Base.current_image_targets()'
Base.ImageTarget[znver4; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, avx512ifma, clflushopt, clwb, avx512cd, sha, avx512bw, avx512vl, avx512vbmi, pku, avx512vbmi2, shstk, gfni, vaes, vpclmulqdq, avx512vnni, avx512bitalg, avx512vpopcntdq, rdpid, sahf, lzcnt, sse4a, prfchw, mwaitx, xsavec, xsaves, clzero, wbnoinvd, avx512bf16)]
$ julia +nightly -e 'using InteractiveUtils; versioninfo()'
Julia Version 1.12.0-DEV.1421
Commit d36417b8230 (2024-10-17 17:37 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 384 × AMD EPYC 9654 96-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 384 virtual cores)
That makes it sound like it might be capturing some values from the build machines and not correctly getting those from JULIA_CPU_TARGET (aka #54093)?
Although note that loading (specifically staticdata.c) is supposed to reject loading a pkgimage that requires more features than are present on the current machine, even if loading.jl makes a mistake, to prevent issues like this. So there are multiple level of errors and failures here
That makes it sound like it might be capturing some values from the build machines
That'd be znver2, not cascadelake, nor skylake-avx512 nor znver4, and if you compare the features on on 1.11.0(-rc4) in https://github.com/JuliaLang/julia/issues/56177#issuecomment-2419719690, https://github.com/JuliaLang/julia/issues/56177#issuecomment-2419797870 and https://github.com/JuliaLang/julia/issues/56177#issuecomment-2420589017 (I used two different clusters) the sets are all different (my skylake-avx512 has pku in addition to what @Vobarkun has, and my znver4 has mwaitx, clzero, rdpid, sha, shstk, gfni, sse4a, wbnoinvd, vpclmulqdq, vaes in addition to my skylake-avx512)
not correctly getting those from JULIA_CPU_TARGET
The current setting of JULIA_CPU_TARGET on x86_64 is actually more restrictive than the set we showed above:
4-element Vector{Base.ImageTarget}:
generic; flags=0; features_en=(cx16)
sandybridge; flags=0; features_en=(sse3, pclmul, ssse3, cx16, sse4.1, sse4.2, popcnt, xsave, avx, sahf)
haswell; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, sahf, lzcnt)
x86-64-v4; flags=32; features_en=()
x86-64-v4 is actually empty, the largest set is haswell, which is smaller than all the set we showed above.
Ok, with
git bisect reset
git bisect start
git bisect good v1.11.0-rc3
git bisect bad v1.11.0-rc4
git bisect run ./bisect.sh
and the following bisect.sh script
#!/bin/bash
export JULIA_CPU_TARGET="generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1);x86-64-v4,-rdrnd,base(1)"
make cleanall
make -j
./julia -E 'Base.current_image_targets()' | grep avx512f
I confirmed that #55729 is indeed the culprit on the v1.11 release branch:
50c1ea848579ddc99e3c3633b85669988c2c89f2 is the first bad commit
commit 50c1ea848579ddc99e3c3633b85669988c2c89f2
Author: Ian Butterworth <[email protected]>
Date: Wed Sep 11 11:50:05 2024 -0400
Precompile the `@time_imports` printing so it doesn't confuse reports (#55729)
Makes functions for the report printing that can be precompiled into the
sysimage.
(cherry picked from commit 255162c7197e973d0427cc11d1e0117cdd76a1bf)
base/loading.jl | 94 +++++++++++++++++++++++++-----------------
contrib/generate_precompile.jl | 9 ++++
2 files changed, 66 insertions(+), 37 deletions(-)
Note that setting JULIA_CPU_TARGET during the build is necessary to replicate the bug, it doesn't trigger without that, nor if setting JULIA_PRECOMPILE=0.
Note that I did this on an avx512 machine, which rules out bad caching properties of the CPU on the build machine: the issue happens regardless of what's the build machine.
But this also doesn't reproduce on 255162c7197e973d0427cc11d1e0117cdd76a1bf, the merge commit of #55729 on master, so yeah, there are multiple levels of errors here.
I can reproduce the issue on 255162c7197e973d0427cc11d1e0117cdd76a1bf if I revert ad407a6d2198c999f8f7b48a85d190694e392eb5, merge commit of #54471 on master, which wasn't backported to release-1.11. Sounds like backporting that PR should fix the issue.
#55729 perhaps unwisely added precompiling rand(2,2) * rand(2,2). Could that be the critical change in that PR?
That's indeed the issue! It's the call to rand which break this, replacing those rand with ones solves the issue for me on v1.11.0-rc4. But I have no clue of why this is happening.
I guess this can be closed now that https://github.com/JuliaLang/julia/pull/56239 has been merged in the backports branch.