julia
julia copied to clipboard
[CI] Intermittent failure in `abstractarray` on aarch64-linux-gnu
Error in testset abstractarray:
Error During Test at none:1
Got exception outside of a @test
ProcessExitedException(8)
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:935
[2] wait()
@ Base ./task.jl:999
[3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
@ Base ./condition.jl:130
[4] wait
@ Base ./condition.jl:125 [inlined]
[5] take_buffered(c::Channel{Any})
@ Base ./channels.jl:477
[6] take!(c::Channel{Any})
@ Base ./channels.jl:471
[7] take!(::Distributed.RemoteValue)
@ Distributed /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:726
[8] remotecall_fetch(::Function, ::Distributed.Worker, ::String, ::Vararg{String}; kwargs::@Kwargs{seed::UInt128})
@ Distributed /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:461
[9] remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{String}; kwargs::@Kwargs{seed::UInt128})
@ Distributed /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492
[10] (::var"#37#47"{Vector{Task}, var"#print_testworker_errored#43"{ReentrantLock, Int64, Int64}, var"#print_testworker_stats#41"{ReentrantLock, Int64, Int64, Int64, Int64, Int64, Int64}, Vector{Any}, Dict{String, DateTime}})()
@ Main /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/test/runtests.jl:258
This is the only error I've seen lately on this platform, but it's intermittent, although quite frequent.
During the ci-dev call we debugged this a little bit. A related error
SharedArrays (1) | started at 2024-01-29T18:35:21.628
From worker 17:
From worker 17: [13757] signal 7 (2): Bus error
From worker 17: in expression starting at none:0
From worker 16:
From worker 16: [13756] signal 7 (2): Bus error
From worker 16: in expression starting at none:0
From worker 17: setindex! at ./array.jl:972 [inlined]
From worker 17: setindex! at ./subarray.jl:403 [inlined]
From worker 17: map! at ./abstractarray.jl:3289 [inlined]
From worker 17: #67 at /cache/build/default-armageddon-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/SharedArrays/src/SharedArrays.jl:548
can be reproduced on the build machine with
using SharedArrays
TR = Float64
dims = (10,)
SharedArray{TR,length(dims)}(dims; init = S -> (@show S.loc_subarr_1d[1]))
There appear to be a crash in segv_handler.
#ci-dev looked into this, and the following is a (more minimal) reproducer:
using SharedArrays
# Arbitrary
dims = (10,)
T = Float64
# Create shared memory, truncate to the right size
fd_mem = SharedArrays.shm_open("/foo", SharedArrays.JL_O_CREAT | SharedArrays.JL_O_RDWR, SharedArrays.S_IRUSR | SharedArrays.S_IWUSR)
s = SharedArrays.fdio(fd_mem, true)
rc = ccall(:jl_ftruncate, Cint, (Cint, Int64), fd_mem, prod(dims)*sizeof(T))
# Ensure that shared memory file exists
run(`ls -la /dev/shm/foo`)
# mmap it and attempt to dereference
A = SharedArrays.mmap(s, Array{T, length(dims)}, dims, zero(Int64); grow=false);
A[1] # <-- dies with SIGBUS
Confirmed that the equivalent C program fails in the same way:
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h> /* For O_* constants */
#include <sys/stat.h> /* For mode constants */
#include <sys/mman.h> /* For shared memory */
#include <unistd.h> /* For ftruncate */
#include <string.h> /* For strlen */
int main() {
const char *name = "/my_shared_memory"; // Name of the shared memory object
const char *message = "Hello, Shared Memory!"; // Message to be written
int shm_fd; // File descriptor of the shared memory
void *ptr; // Pointer to the shared memory
// Create the shared memory object
shm_fd = shm_open(name, O_CREAT | O_RDWR, 0666);
if (shm_fd == -1) {
perror("Error creating shared memory");
return EXIT_FAILURE;
}
// Configure the size of the shared memory object
ftruncate(shm_fd, 4096);
// Memory map the shared memory object
ptr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
if (ptr == MAP_FAILED) {
perror("Error mapping shared memory");
return EXIT_FAILURE;
}
// Write to the shared memory object
sprintf(ptr, "%s", message);
ptr += strlen(message);
// Now, the memory contains "Hello, Shared Memory!"
printf("Data written to shared memory: %s\n", (char *)ptr - strlen(message));
// Unmap the shared memory
munmap(ptr, 4096);
// Close the shared memory object
close(shm_fd);
// Optionally, remove the shared memory object
// shm_unlink(name);
return EXIT_SUCCESS;
}
(Many thanks to ChatGPT)
We assume that this is now a kernel or glibc bug (currently running v5.4 and v2.17, respectively), and we may need to upgrade our buildbot to a newer version. It is easier to try upgrading the kernel (ironically) as we supposedly support glibc v2.17+, so we should first try a newer kernel with the current rootfs images and see what happens.
For the record, I couldn't reproduce the crash, not with the julia nor C reproducers, on any aarch64-linux-gnu system I have access to (the C example runs fine and prints Data written to shared memory: Hello, Shared Memory!), but they typically have glibc newer than 2.19, so this smells a bit like glibc bug.
Confirmed that updating the kernel from v5.4 -> v5.15 has solved this, so presumably this was a kernel bug! Huzzah!
Our aarch64-linux worker has been updated, so this should be fixed now. Will leave open until tests confirm this.
Update: looks like we solved the SharedArrays problem, but the abstractarray test is still failing 😂
To quote myself from the meeting:
I don't care about Shared Arrays
;)
Looks like this was fixed by #54718!