julia icon indicating copy to clipboard operation
julia copied to clipboard

[CI] Intermittent failure in `abstractarray` on aarch64-linux-gnu

Open giordano opened this issue 1 year ago • 7 comments

Example:

Error in testset abstractarray:
Error During Test at none:1
  Got exception outside of a @test
  ProcessExitedException(8)
  Stacktrace:
    [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
      @ Base ./task.jl:935
    [2] wait()
      @ Base ./task.jl:999
    [3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
      @ Base ./condition.jl:130
    [4] wait
      @ Base ./condition.jl:125 [inlined]
    [5] take_buffered(c::Channel{Any})
      @ Base ./channels.jl:477
    [6] take!(c::Channel{Any})
      @ Base ./channels.jl:471
    [7] take!(::Distributed.RemoteValue)
      @ Distributed /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:726
    [8] remotecall_fetch(::Function, ::Distributed.Worker, ::String, ::Vararg{String}; kwargs::@Kwargs{seed::UInt128})
      @ Distributed /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:461
    [9] remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{String}; kwargs::@Kwargs{seed::UInt128})
      @ Distributed /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492
   [10] (::var"#37#47"{Vector{Task}, var"#print_testworker_errored#43"{ReentrantLock, Int64, Int64}, var"#print_testworker_stats#41"{ReentrantLock, Int64, Int64, Int64, Int64, Int64, Int64}, Vector{Any}, Dict{String, DateTime}})()
      @ Main /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/test/runtests.jl:258

This is the only error I've seen lately on this platform, but it's intermittent, although quite frequent.

giordano avatar Dec 06 '23 23:12 giordano

During the ci-dev call we debugged this a little bit. A related error

SharedArrays                                      (1) |        started at 2024-01-29T18:35:21.628
      From worker 17:
      From worker 17:	[13757] signal 7 (2): Bus error
      From worker 17:	in expression starting at none:0
      From worker 16:
      From worker 16:	[13756] signal 7 (2): Bus error
      From worker 16:	in expression starting at none:0
      From worker 17:	setindex! at ./array.jl:972 [inlined]
      From worker 17:	setindex! at ./subarray.jl:403 [inlined]
      From worker 17:	map! at ./abstractarray.jl:3289 [inlined]
      From worker 17:	#67 at /cache/build/default-armageddon-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/SharedArrays/src/SharedArrays.jl:548

can be reproduced on the build machine with

using SharedArrays
TR = Float64
dims = (10,)
SharedArray{TR,length(dims)}(dims; init = S -> (@show S.loc_subarr_1d[1]))

There appear to be a crash in segv_handler.

giordano avatar Jan 29 '24 19:01 giordano

#ci-dev looked into this, and the following is a (more minimal) reproducer:

using SharedArrays

# Arbitrary
dims = (10,)
T = Float64

# Create shared memory, truncate to the right size
fd_mem = SharedArrays.shm_open("/foo", SharedArrays.JL_O_CREAT | SharedArrays.JL_O_RDWR, SharedArrays.S_IRUSR | SharedArrays.S_IWUSR)
s = SharedArrays.fdio(fd_mem, true)
rc = ccall(:jl_ftruncate, Cint, (Cint, Int64), fd_mem, prod(dims)*sizeof(T))

# Ensure that shared memory file exists
run(`ls -la /dev/shm/foo`)

# mmap it and attempt to dereference
A = SharedArrays.mmap(s, Array{T, length(dims)}, dims, zero(Int64); grow=false);
A[1] # <-- dies with SIGBUS

staticfloat avatar Jan 29 '24 19:01 staticfloat

Confirmed that the equivalent C program fails in the same way:

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>           /* For O_* constants */
#include <sys/stat.h>        /* For mode constants */
#include <sys/mman.h>        /* For shared memory */
#include <unistd.h>          /* For ftruncate */
#include <string.h>          /* For strlen */

int main() {
    const char *name = "/my_shared_memory"; // Name of the shared memory object
    const char *message = "Hello, Shared Memory!"; // Message to be written
    int shm_fd;     // File descriptor of the shared memory
    void *ptr;      // Pointer to the shared memory

    // Create the shared memory object
    shm_fd = shm_open(name, O_CREAT | O_RDWR, 0666);
    if (shm_fd == -1) {
        perror("Error creating shared memory");
        return EXIT_FAILURE;
    }

    // Configure the size of the shared memory object
    ftruncate(shm_fd, 4096);

    // Memory map the shared memory object
    ptr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
    if (ptr == MAP_FAILED) {
        perror("Error mapping shared memory");
        return EXIT_FAILURE;
    }

    // Write to the shared memory object
    sprintf(ptr, "%s", message);
    ptr += strlen(message);

    // Now, the memory contains "Hello, Shared Memory!"
    printf("Data written to shared memory: %s\n", (char *)ptr - strlen(message));

    // Unmap the shared memory
    munmap(ptr, 4096);

    // Close the shared memory object
    close(shm_fd);

    // Optionally, remove the shared memory object
    // shm_unlink(name);

    return EXIT_SUCCESS;
}

(Many thanks to ChatGPT)

We assume that this is now a kernel or glibc bug (currently running v5.4 and v2.17, respectively), and we may need to upgrade our buildbot to a newer version. It is easier to try upgrading the kernel (ironically) as we supposedly support glibc v2.17+, so we should first try a newer kernel with the current rootfs images and see what happens.

staticfloat avatar Jan 29 '24 20:01 staticfloat

For the record, I couldn't reproduce the crash, not with the julia nor C reproducers, on any aarch64-linux-gnu system I have access to (the C example runs fine and prints Data written to shared memory: Hello, Shared Memory!), but they typically have glibc newer than 2.19, so this smells a bit like glibc bug.

giordano avatar Jan 29 '24 20:01 giordano

Confirmed that updating the kernel from v5.4 -> v5.15 has solved this, so presumably this was a kernel bug! Huzzah!

Our aarch64-linux worker has been updated, so this should be fixed now. Will leave open until tests confirm this.

staticfloat avatar Jan 30 '24 00:01 staticfloat

Update: looks like we solved the SharedArrays problem, but the abstractarray test is still failing 😂

staticfloat avatar Jan 31 '24 17:01 staticfloat

To quote myself from the meeting:

I don't care about Shared Arrays

;)

vchuravy avatar Jan 31 '24 20:01 vchuravy

Looks like this was fixed by #54718!

staticfloat avatar Jun 17 '24 21:06 staticfloat