AMDGPU.jl icon indicating copy to clipboard operation
AMDGPU.jl copied to clipboard

Guidance on randn hanging on MI250X

Open williamfgc opened this issue 2 years ago • 3 comments

The question is three-fold:

  1. What's the best way to generate random numbers (not arrays) on device kernel code.
  2. When trying AMDGPU.randn(eltype(u), 1) or randn(eltype(u), 1) (from Base), AMDGPU.jl hangs with those on the Crusher MI250X using rocm/5.4.0 (no error, job just dies until times out after 3 min/4 min). Without those it runs fine, but the problem requires random numbers on device.

Any guidance is appreciated and thanks for the great job.

williamfgc avatar Feb 07 '23 14:02 williamfgc

Can you split the issue into device side RNG and the host side hang? They are unrelated and likely need different work to fix

vchuravy avatar Feb 07 '23 14:02 vchuravy

@vchuravy Will do, I changed the description and will open a separate issue. Thanks!

williamfgc avatar Feb 07 '23 15:02 williamfgc

@williamfgc what stacktrace do you get if you Ctrl-C a hanging randn call?

jpsamaroo avatar Feb 07 '23 17:02 jpsamaroo