CUDA.jl
CUDA.jl copied to clipboard
Segfault during CUBLAS logging
Apparently there's still some issue with the logger:
As reported by @femtomc, encountered on CUDA.jl 3.3.4 with JULIA_DEBUG=CUDA.
Device info:
ubuntu in mbecker in ~ on ☁️ (us-east-2)
❯ neofetch ~ master
.-/+oossssoo+/-. ubuntu@mbecker
`:+ssssssssssssssssss+:` --------------
-+ssssssssssssssssssyyssss+- OS: Ubuntu 20.04.2 LTS x86_64
.ossssssssssssssssssdMMMNysssso. Host: t3.xlarge
/ssssssssssshdmmNNmmyNMMMMhssssss/ Kernel: 5.8.0-1038-aws
+ssssssssshmydMMMMMMMNddddyssssssss+ Uptime: 10 days, 1 hour, 9 mins
/sssssssshNMMMyhhyyyyhmNMMMNhssssssss/ Packages: 772 (dpkg), 6 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Shell: zsh 5.8
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ Terminal: /dev/pts/4
ossyNMMMNyMMhsssssssssssssshmmmhssssssso CPU: Intel Xeon Platinum 8259CL (4) @ 2.499GHz
ossyNMMMNyMMhsssssssssssssshmmmhssssssso GPU: 00:03.0 Amazon.com, Inc. Device 1111
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ Memory: 1644MiB / 15827MiB
.ssssssssdMMMNhsssssssssshNMMMdssssssss.
/sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
+sssssssssdmydMMMMMMMMddddyssssssss+
/ssssssssssshdmNNNNmyNMMMMhssssss/
.ossssssssssssssssssdMMMNysssso.
-+sssssssssssssssssyyyssss+-
`:+ssssssssssssssssss+:`
.-/+oossssoo+/-.
ubuntu in mbecker in ~ on ☁️ (us-east-2)
❯ julia ~ master
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.6.1 (2021-04-23)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake-avx512)
Environment:
JULIA_VERSION = 1.6.1
Haven't been able to reproduce; I let the cublas
tests run for a couple of hours with JULIA_DEBUG=CUDA
set...
@maleadt At the very least, that convinces me it might not be a CUDA issue -- but rather something in Distributed
related to task handling.
I was also also able to produce segfaults by trying to log information from the GPU on a task before moving it to the CPU with cpu
.
Possible better to change title of issue as I continue to investigate.
@femtomc pointed out this could be related to #1314 (I believe the issue did occur with CUDA in a sysimage)
That's likely, as these callbacks also use @cfunction
(ref https://github.com/JuliaLang/julia/issues/43748). On the other hand, the backtrace here points to only Julia code, so is likely to have happened on a Julia thread.