CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

Segfault during CUBLAS logging

Open maleadt opened this issue 3 years ago • 6 comments

Apparently there's still some issue with the logger: image

As reported by @femtomc, encountered on CUDA.jl 3.3.4 with JULIA_DEBUG=CUDA.

maleadt avatar Jul 22 '21 16:07 maleadt

Device info:

ubuntu in mbecker in ~ on ☁️  (us-east-2)
❯ neofetch                                                                                                                                                                                               ~ master
            .-/+oossssoo+/-.               ubuntu@mbecker
        `:+ssssssssssssssssss+:`           --------------
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 20.04.2 LTS x86_64
    .ossssssssssssssssssdMMMNysssso.       Host: t3.xlarge
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.8.0-1038-aws
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 10 days, 1 hour, 9 mins
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 772 (dpkg), 6 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: zsh 5.8
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Terminal: /dev/pts/4
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: Intel Xeon Platinum 8259CL (4) @ 2.499GHz
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   GPU: 00:03.0 Amazon.com, Inc. Device 1111
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Memory: 1644MiB / 15827MiB
.ssssssssdMMMNhsssssssssshNMMMdssssssss.
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
  +sssssssssdmydMMMMMMMMddddyssssssss+
   /ssssssssssshdmNNNNmyNMMMMhssssss/
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.


ubuntu in mbecker in ~ on ☁️  (us-east-2)
❯ julia                                                                                                                                                                                                  ~ master
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.1 (2021-04-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_VERSION = 1.6.1

femtomc avatar Jul 22 '21 16:07 femtomc

Haven't been able to reproduce; I let the cublas tests run for a couple of hours with JULIA_DEBUG=CUDA set...

maleadt avatar Jul 22 '21 19:07 maleadt

@maleadt At the very least, that convinces me it might not be a CUDA issue -- but rather something in Distributed related to task handling.

I was also also able to produce segfaults by trying to log information from the GPU on a task before moving it to the CPU with cpu.

femtomc avatar Jul 22 '21 21:07 femtomc

Possible better to change title of issue as I continue to investigate.

femtomc avatar Jul 22 '21 21:07 femtomc

@femtomc pointed out this could be related to #1314 (I believe the issue did occur with CUDA in a sysimage)

ericphanson avatar Jan 11 '22 20:01 ericphanson

That's likely, as these callbacks also use @cfunction (ref https://github.com/JuliaLang/julia/issues/43748). On the other hand, the backtrace here points to only Julia code, so is likely to have happened on a Julia thread.

maleadt avatar Jan 12 '22 06:01 maleadt