CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

GC corruption on 1.10 during cusparse/reduce tests

Open maleadt opened this issue 2 years ago • 7 comments

We've been seeing this frequently on CI, e.g., https://buildkite.com/julialang/cuda-dot-jl/builds/4151#0189e93f-3649-485f-bb5b-1cd9b2b9713d. Snippet:

GC error (probable corruption)
Allocations: 1078722668 (Pool: 1077025051; Big: 1697617); GC: 326
<?#0x7fb5614741f0::<circular reference @-1>>

[1026959] signal (6.-1845589712): Aborted
in expression starting at none:0
gsignal at /usr/lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /usr/lib/x86_64-linux-gnu/libc.so.6 (unknown line)
gc_dump_queue_and_abort at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gc.c:1840
gc_mark_outrefs at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gc.c:2543 [inlined]
gc_mark_loop_serial_ at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gc.c:2712
gc_mark_loop_serial at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gc.c:2735
gc_mark_loop at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gc.c:2848 [inlined]
_jl_gc_collect at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gc.c:3174
ijl_gc_collect at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gc.c:3472
gc at ./gcutils.jl:129 [inlined]
runtests at /var/lib/buildkite-agent/builds/gpuci-9/julialang/cuda-dot-jl/test/setup.jl:117
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gf.c:2889 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gf.c:3071
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/julia.h:1966 [inlined]
jl_f__call_latest at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/builtins.c:812
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gf.c:2889 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gf.c:3071
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/julia.h:1966 [inlined]
do_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/builtins.c:768
#invokelatest#2 at ./essentials.jl:887
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gf.c:2889 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gf.c:3071
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/julia.h:1966 [inlined]
do_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/builtins.c:768
invokelatest at ./essentials.jl:884
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gf.c:2889 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gf.c:3071
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/julia.h:1966 [inlined]
do_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/builtins.c:768
#110 at /root/.cache/julia-buildkite-plugin/julia_installs/bin/linux/x86_64/1.10/julia-latest-linux-x86_64/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:285
run_work_thunk at /root/.cache/julia-buildkite-plugin/julia_installs/bin/linux/x86_64/1.10/julia-latest-linux-x86_64/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
#109 at /root/.cache/julia-buildkite-plugin/julia_installs/bin/linux/x86_64/1.10/julia-latest-linux-x86_64/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:285
unknown function (ip: 0x7fb63451a242)
_jl_invoke at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gf.c:2889 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/gf.c:3071
jl_apply at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/julia.h:1966 [inlined]
start_task at /cache/build/default-amdci5-0/julialang/julia-release-1-dot-10/src/task.c:1238
Allocations: 1078722668 (Pool: 1077025051; Big: 1697617); GC: 326

The full error: https://gist.github.com/maleadt/606e271c7c4b996552dad7fe4f6a8c0e

Not sure how to debug this

maleadt avatar Aug 12 '23 11:08 maleadt

Is this a new bug?

gbaraldi avatar Aug 14 '23 13:08 gbaraldi

cc: @d-netto

vchuravy avatar Aug 14 '23 15:08 vchuravy

Is this a new bug?

I haven't seen it on <1.10, so it's new in that sense. But I've only recently started paying attention to 1.10 CI logs, so I'm not sure how recently it got introduced upstream.

maleadt avatar Aug 14 '23 20:08 maleadt

I can reproduce this locally, when running a specific combination of tests in sequence:

jltest -- --jobs=1 'core/device/intrinsics/wmma' 'libraries/cusparse/broadcast' 'libraries/cusparse/interfaces' 'libraries/cusparse/linalg' 'libraries/cusparse/reduce'

(where jltest is just an alias that does Pkg.test in --project setting jlargs)

Any suggestions on how to debug this? Or on how to flush out the corruption earlier? Running with GC_VERIFY, maybe? If I can have it fail earlier, I can try to reduce this to see if isn't a case of CUDA.jl badly managing memory.

maleadt avatar Aug 21 '23 19:08 maleadt

So this is usually a type tag that got messed up, GC_VERIFY might work, but I usually try and get this under rr and watch the corrupted object to see where and why it got currupted, with the typical results being it didn't get marked when it should've, which means finding the parent and doing the same thing until you find the original corruption.

gbaraldi avatar Aug 21 '23 20:08 gbaraldi

rr doesn't work with CUDA :/

vchuravy avatar Aug 21 '23 20:08 vchuravy

Also reproduces on master, and GC_DEBUG_ENV (which includes GC_VERIFY) doesn't help.

maleadt avatar Aug 22 '23 08:08 maleadt