llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[SYCL][HIP] "Memory access fault by GPU" from GROMACS

Open al42and opened this issue 3 years ago • 17 comments

Describe the bug

Running a SYCL program on AMD GPU results in it crashing with the following error:

Memory access fault by GPU node-4 (Agent handle: 0x215c1f0) on address 0x7ff5f0d6d000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

The same program runs fine with hipSYCL on the same machine. An identical error was also observed by a 3rd party on Hygon DCU (based on the same Vega 20).

To Reproduce

Unfortunately, I could not come up with a short reproducer. Tested with latest IntelLLVM, ROCm 5.0.2, AMD MI50.

  • git clone https://gitlab.com/gromacs/gromacs.git -b aa-hwe-release-2022-dpcpp-hip --depth=10 && cd gromacs
  • mkdir build && cd build && cmake .. -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DGMX_GPU=SYCL -DGMX_GPU_NB_CLUSTER_SIZE=8 -DSYCL_CXX_FLAGS_EXTRA='-fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906'
  • make tests -j$(nproc)
  • SYCL_DEVICE_FILTER=hip:gpu ./bin/mdrun-pull-test
    • SYCL_DEVICE_FILTER=hip:gpu is needed to mask OpenCL devices (#5825).
  • See it crash.

Sometimes the program just hangs. However, running it under gdb reliably leads to the crash.

Output of SYCL_PI_TRACE=2 SYCL_DEVICE_FILTER=hip:gpu gdb --args ./bin/mdrun-pull-test:
....
---> piKernelSetArg(
        <unknown> : 0xe32bd0
        <unknown> : 29
        <unknown> : 8
        <unknown> : 0x795248
) --->  pi_result : PI_SUCCESS
        [out]void * : 0x795248

---> piKernelSetArg(
        <unknown> : 0xe32bd0
        <unknown> : 30
        <unknown> : 8
        <unknown> : 0x795250
) --->  pi_result : PI_SUCCESS
        [out]void * : 0x795250

---> piEnqueueKernelLaunch(
        <unknown> : 0xe4e030
        <unknown> : 0xe32bd0
        <unknown> : 3
        <unknown> : 0x7fffffff8a30
        <unknown> : 0x7fffffff8a00
        <unknown> : 0x7fffffff8a18
        <unknown> : 0
        pi_event * : 0[ nullptr ]
        pi_event * : 0x1087218[ 0 ... ]
) --->  pi_result : PI_SUCCESS
        [out]pi_event * : 0[ nullptr ]
        [out]pi_event * : 0x1087218[ 0xe407c0 ... ]

---> piEventRelease(
        pi_event : 0xfb0e10
) --->  pi_result : PI_SUCCESS

---> piextUSMEnqueueMemcpy(
        <unknown> : 0xe4e030
        <unknown> : 0
        <unknown> : 0xeec380
        <unknown> : 0x7ffff260f000
        <unknown> : 9216
        <unknown> : 0
        pi_event * : 0[ nullptr ]
        pi_event * : 0x1073508[ 0 ... ]
:0:rocdevice.cpp            :2603: 9162115875909 us: 2064379: [tid:0x7ffff3abf700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29

Thread 2 "mdrun-pull-test" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff3abf700 (LWP 2064383)]
0x00007ffff5bbc03b in raise () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) bt
#0  0x00007ffff5bbc03b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff5b9b859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff42298ec in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#3  0x00007ffff3b6a78f in bool rocr::AMD::AqlQueue::DynamicScratchHandler<true>(long, void*) () from /opt/tcbsys/rocm/5.0.2/lib/libhsa-runtime64.so.1
#4  0x00007ffff3ba645b in rocr::core::Runtime::AsyncEventsLoop(void*) () from /opt/tcbsys/rocm/5.0.2/lib/libhsa-runtime64.so.1
#5  0x00007ffff3b49757 in rocr::os::ThreadTrampoline(void*) () from /opt/tcbsys/rocm/5.0.2/lib/libhsa-runtime64.so.1
#6  0x00007ffff6462609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007ffff5c98163 in clone () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff4ebb2c0 (LWP 2064379))]

(gdb) bt 24
#0  0x00007ffff3b927ec in rocr::core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/tcbsys/rocm/5.0.2/lib/libhsa-runtime64.so.1
#1  0x00007ffff3b925ca in rocr::core::InterruptSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/tcbsys/rocm/5.0.2/lib/libhsa-runtime64.so.1
#2  0x00007ffff3b82fc9 in rocr::HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/tcbsys/rocm/5.0.2/lib/libhsa-runtime64.so.1
#3  0x00007ffff4232f81 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#4  0x00007ffff423e09b in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#5  0x00007ffff423e7b8 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#6  0x00007ffff42424d7 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#7  0x00007ffff426e62e in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#8  0x00007ffff426ffc0 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#9  0x00007ffff42701dd in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#10 0x00007ffff423f157 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#11 0x00007ffff4212a06 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#12 0x00007ffff4108987 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#13 0x00007ffff413bd7a in hipMemcpyAsync () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#14 0x00007ffff4ea4147 in hip_piextUSMEnqueueMemcpy () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libpi_hip.so
#15 0x00007ffff604c44f in cl::sycl::detail::MemoryManager::copy_usm(void const*, std::shared_ptr<cl::sycl::detail::queue_impl>, unsigned long, void*, std::vector<_pi_event*, std::allocator<_pi_event*> >, _pi_event**) () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#16 0x00007ffff60bded4 in cl::sycl::detail::ExecCGCommand::enqueueImp() () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#17 0x00007ffff60b2332 in cl::sycl::detail::Command::enqueue(cl::sycl::detail::EnqueueResultT&, cl::sycl::detail::BlockingT, std::vector<cl::sycl::detail::Command*, std::allocator<cl::sycl::detail::Command*> >&) () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#18 0x00007ffff60c5a9e in cl::sycl::detail::Scheduler::addCG(std::unique_ptr<cl::sycl::detail::CG, std::default_delete<cl::sycl::detail::CG> >, std::shared_ptr<cl::sycl::detail::queue_impl>) () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#19 0x00007ffff610e068 in cl::sycl::handler::finalize() () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#20 0x00007ffff613364f in cl::sycl::detail::queue_impl::submit_impl(std::function<void (cl::sycl::handler&)> const&, std::shared_ptr<cl::sycl::detail::queue_impl> const&, std::shared_ptr<cl::sycl::detail::queue_impl> const&, std::shared_ptr<cl::sycl::detail::queue_impl> const&, cl::sycl::detail::code_location const&, std::function<void (bool, bool, cl::sycl::event&)> const*) () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#21 0x00007ffff6133e10 in cl::sycl::queue::submit_impl(std::function<void (cl::sycl::handler&)>, cl::sycl::detail::code_location const&) () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#22 0x00007ffff7504978 in cl::sycl::event cl::sycl::queue::submit<copyFromDeviceBuffer<gmx::BasicVector<float> >(gmx::BasicVector<float>*, DeviceBuffer<gmx::BasicVector<float> >*, unsigned long, unsigned long, DeviceStream const&, GpuApiCallBehavior, void**)::{lambda(cl::sycl::handler&)#1}>(copyFromDeviceBuffer<gmx::BasicVector<float> >(gmx::BasicVector<float>*, DeviceBuffer<gmx::BasicVector<float> >*, unsigned long, unsigned long, DeviceStream const&, GpuApiCallBehavior, void**)::{lambda(cl::sycl::handler&)#1}, cl::sycl::detail::code_location const&) () from /tmp/gromacs/build/intel/gromacs/build/bin/../lib/libgromacs.so.7
#23 0x00007ffff7504701 in void copyFromDeviceBuffer<gmx::BasicVector<float> >(gmx::BasicVector<float>*, DeviceBuffer<gmx::BasicVector<float> >*, unsigned long, unsigned long, DeviceStream const&, GpuApiCallBehavior, void**) () from /tmp/gromacs/build/intel/gromacs/build/bin/../lib/libgromacs.so.7
(More stack frames follow...)

Environment (please complete the following information):

  • OS: Ubuntu Linux, 5.4.0-99-generic
  • Target device and vendor: AMD MI50
  • DPC++ version: a1b42aa6037aba9b86d40d8c1c59c0dc2f941481
  • Dependencies version: ROCm 5.0.2

al42and avatar May 27 '22 19:05 al42and

Tagging @rolandschulz

al42and avatar May 27 '22 19:05 al42and

An identical error was also observed by a 3rd party on Hygon DCU (based on the same Vega 20). Is it SYCL or HIP error ?

zjin-lcf avatar May 27 '22 23:05 zjin-lcf

Is it SYCL or HIP error ?

Hard to say, but almost identical code runs fine with the same hardware with the same ROCm runtime when using hipSYCL.

al42and avatar May 29 '22 23:05 al42and

Does the buffer version work ?

#if GMX_SYCL_USE_USM
    ev = deviceStream.stream().submit([&](sycl::handler& cgh) {
        cgh.memcpy(hostBuffer, buffer->buffer_->ptr_ + startingOffset, numValues * sizeof(ValueType));
    });
#else
    sycl::buffer<ValueType>& syclBuffer = *buffer->buffer_;

    ev = deviceStream.stream().submit([&](sycl::handler& cgh) {
        const auto d_bufferAccessor = sycl::accessor<ValueType, 1, sycl::access_mode::read>{
            syclBuffer, cgh, sycl::range(numValues), sycl::id(startingOffset)
        };
        cgh.copy(d_bufferAccessor, hostBuffer);
    });
#endif
    if (transferKind == GpuApiCallBehavior::Sync)
    {
        ev.wait_and_throw();
    }

zjin-lcf avatar Jun 13 '22 02:06 zjin-lcf

Does the buffer version work ?

#if GMX_SYCL_USE_USM
    ev = deviceStream.stream().submit([&](sycl::handler& cgh) {
        cgh.memcpy(hostBuffer, buffer->buffer_->ptr_ + startingOffset, numValues * sizeof(ValueType));
    });
#else
    sycl::buffer<ValueType>& syclBuffer = *buffer->buffer_;

    ev = deviceStream.stream().submit([&](sycl::handler& cgh) {
        const auto d_bufferAccessor = sycl::accessor<ValueType, 1, sycl::access_mode::read>{
            syclBuffer, cgh, sycl::range(numValues), sycl::id(startingOffset)
        };
        cgh.copy(d_bufferAccessor, hostBuffer);
    });
#endif
    if (transferKind == GpuApiCallBehavior::Sync)
    {
        ev.wait_and_throw();
    }

I met the same problem on AMD GPU, and found the code ending here accidentally whether USM is turned on or not. Is there any reason cause it? Thanks!

hqhy avatar Jun 13 '22 10:06 hqhy

I tried to build gromac, but couldn't reproduce the error. Hopefully, others could.

clang++: https://github.com/intel/llvm/commit/9a9a7a4026a0fe9892a275fe70e1c8330af89792 (around 5/13) device: gfx908 rocm: 4.5.2

Last energy frame read 4 time    0.020         [       OK ] PullTest/PullIntegrationTest.WithinTolerances/3 (422 ms)
[----------] 4 tests from PullTest/PullIntegrationTest (1768 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (3129 ms total)
[  PASSED  ] 4 tests.

zjin-lcf avatar Jun 13 '22 12:06 zjin-lcf

I also can't reproduce the error.

My versions: clang++: b59cd43a7ce29e20868fa51ef70a23b6d99145b7 device: gfx90a (MI200) rocm: 5.1.3

@al42and Is it reasonable easy for you to check whether you get the error also for different device/rocm version?

rolandschulz avatar Jun 13 '22 17:06 rolandschulz

May I know your gromacs version? And I want to make sure that you are using SYCL instead of HipSYCL. Looking forward to your reply!

hqhy avatar Jun 14 '22 02:06 hqhy

GROMACS version is aa-hwe-release-2022-dpcpp-hip. DPC++ not hipSYCL. @hqhy which version of clang++, device, and rocm are you testing with?

rolandschulz avatar Jun 14 '22 04:06 rolandschulz

GROMACS version is aa-hwe-release-2022-dpcpp-hip. DPC++ not hipSYCL. @hqhy which version of clang++, device, and rocm are you testing with?

My version: clang++ : ca457d9622 device: gfx906 rocm: 3.10 When I ran with the command gmx mdrun -v -deffnm em, I got this error. @rolandschulz

hqhy avatar Jun 14 '22 04:06 hqhy

@hqhy please note that the earliest version of ROCm that was tested with DPC++ for hip was 4.2, so I'm not sure if it is related to the error here but using ROCm 3.10 might cause issues

npmiller avatar Jun 15 '22 09:06 npmiller

@hqhy please note that the earliest version of ROCm that was tested with DPC++ for hip was 4.2, so I'm not sure if it is related to the error here but using ROCm 3.10 might cause issues

@npmiller Thanks for your reply! I am not sure that ROCm version cause this issue. As @al42and mentioned, his ROCm verision is 5.0.2, but also met this issue. Maybe there are other reasons for problem? By the way, this is a runtime error, instead of appearing at compile time.

hqhy avatar Jun 15 '22 10:06 hqhy

Does the buffer version work ?

@zjin-lcf, no, same error.

Is it reasonable easy for you to check whether you get the error also for different device/rocm version?

@rolandschulz I'll see what I can do.

al42and avatar Jun 15 '22 11:06 al42and

I can reproduce the error on a gfx906 device.

zjin-lcf avatar Jun 16 '22 02:06 zjin-lcf

I can reproduce the error on a gfx906 device.

@zjin-lcf Are you able to reproduce this issue on gfx908(MI100) or other amd gpu?

Here are two related issues that might provide some help. ISSUE 1339 & ISSUE 1220

hqhy avatar Jun 27 '22 04:06 hqhy

I've been looking into this and I'm still trying to track down exactly what the issue is, but I found out that on gfx906 adding the -fno-unroll-loops flag to SYCL_CXX_FLAGS_EXTRA seems to fix the issue, I don't have access to a MI200 device to see if it fixes it there as well but it might.

It's obviously not ideal but this could be a workaround until we find a proper fix.

npmiller avatar Aug 02 '22 09:08 npmiller

Quid with latest ROCm 5.2.1?

keryell avatar Aug 04 '22 18:08 keryell

I wasn't able to test this on ROCm 5.2.1.

However I ended up tracking this all the way down to what I believe is a bug in the compiler during register allocation. And I have a patch which fixes the issue and with it I can successfully run the specific test on gfx906.

I've submitted it to upstream LLVM so it might take a little while to land in DPC++ but I'll update this ticket once it's in:

  • https://reviews.llvm.org/D131884

The LLVM patch roughly goes over what is going on, but essentially the register allocator was incorrectly moving an instruction used for address calculation from the loop header to the inside of the loop, which resulted in incorrect data being used during the address calculation, which lead to the memory access faults we're seeing.

And the reason it didn't fail on gfx908 is that on MI100 instead of moving the instruction inside of the loop it uses an ACC vector register to spill it, this doesn't happen on gfx906 as it doesn't have ACC registers, and on gfx90a ACC registers can be used as general purpose vector registers so they also aren't used for spilling.

npmiller avatar Aug 15 '22 12:08 npmiller

The patch has now been merged in upstream LLVM, so it should land in DPC++ in the next pulldown.

npmiller avatar Aug 17 '22 08:08 npmiller

The issue does not reproduce anymore with cc03176dc3c938aa9fef808d57471d540b69931f (which includes @npmiller's fix).

al42and avatar Sep 16 '22 15:09 al42and