ROCclr icon indicating copy to clipboard operation
ROCclr copied to clipboard

MLIR test crashes because getVQVirtualAddress() is called when virtualQueue_ is nullptr

Open pcf000 opened this issue 1 year ago • 0 comments

The AMD MLIR group has a buildbot that tests the main-line MLIR repo with some AMD-specific options. One test has recently started failing, after a change in the codegen. The problem is not with that change, but rather with the cleanup after the test's kernel has run. The backtrace of the crash is

#1 0x7f53e6d63a01 in roc::Device::getRocMemory(amd::Memory*) const (/usr/home/pf/hipamd/build/lib/libamdhip64.so.5+0x53ea01)
#2 0x7f53e6da1a4b in roc::VirtualGPU::getVQVirtualAddress() (/usr/home/pf/hipamd/build/lib/libamdhip64.so.5+0x57ca4b)
#3 0x7f53e6daea68 in roc::VirtualGPU::submitKernelInternal(amd::NDRangeContainer const&, amd::Kernel const&, unsigned char const*, void*, unsigned int, amd::NDRangeKernelCommand*) (/usr/home/pf/hipamd/build/lib/libamdhip64.so.5+0x589a68)
#4 0x7f53e6daf591 in roc::VirtualGPU::submitKernel(amd::NDRangeKernelCommand&) (/usr/home/pf/hipamd/build/lib/libamdhip64.so.5+0x58a591)
#5 0x7f53e6d1d34d in amd::NDRangeKernelCommand::submit(device::VirtualDevice&) /home/pf/ROCclr/cmake/../platform/command.hpp:1059:63
#6 0x7f53e6d15995 in amd::Command::enqueue() /home/pf/ROCclr/platform/command.cpp:370:7
#7 0x7f53e6b84e2b in ihipModuleLaunchKernel(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, unsigned long, unsigned int) /home/pf/hipamd/src/hip_module.cpp:394:12
#8 0x7f53e6b85bdf in hipModuleLaunchKernel /home/pf/hipamd/src/hip_module.cpp:423:3
#9 0x7f53e78ffd7f in mgpuLaunchKernel /usr/home/pf/llvm-project/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp:61:3
#10 0x7f53dded1089  (<unknown module>)
#11 0x7f53dded10dc  (<unknown module>)
#12 0x6a294a1 in compileAndExecute((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine> >) (/usr/home/pf/llvm-project/build/bin/mlir-cpu-runner+0x6a294a1)
#13 0x6a281c7 in compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine> >) (/usr/home/pf/llvm-project/build/bin/mlir-cpu-runner+0x6a281c7)
#14 0x6a2487d in mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (/usr/home/pf/llvm-project/build/bin/mlir-cpu-runner+0x6a2487d)
#15 0x555a280 in main /usr/home/pf/llvm-project/mlir/tools/mlir-cpu-runner/mlir-cpu-runner.cpp:33:10
#16 0x7f53f6391082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
#17 0x545e8ad in _start (/usr/home/pf/llvm-project/build/bin/mlir-cpu-runner+0x545e8ad)

Frames 17 up to 9 are from mlir-cpu-runner, which takes an MLIR file and runs it on the CPU. Then there are a couple of HIP frames, and then we're into ROCclr (via libamdhip64.so). We're in submitKernelInternal(), near the end, just after the printf buffer has been output. (The test happens to be just a printf, and I can see what it prints, before the crash.) It calls runScheduler(), and one argument is getVQVirtualAddress():

  if (gpuKernel.dynamicParallelism()) {
      dispatchBarrierPacket(kBarrierPacketHeader, true);
      static_cast<KernelBlitManager&>(blitMgr()).runScheduler(
          getVQVirtualAddress(), schedulerParam_, schedulerQueue_, schedulerSignal_, schedulerThreads_);
  }

The problem is that virtualQueue_ has never been set. Its default value is nullptr. There are two setup paths that give it a value, but neither is taken by this test. A call under getVQVirtualAddress() tries to read through virtualQueue_, and barfs because it's nullptr.

I do not know whether this code should guard against nullptr, or if there is a guarantee that one of the setup paths will be taken. For testing, I've added "&& virtualQueue_" to the IF quoted above, and that passes all the MLIR tests.

To reproduce, make sure that ROCm is installed (I was using 5.4.2), then compile and test as directed in https://mlir.llvm.org/getting_started/, with the additional cmake options "-DMLIR_INCLUDE_INTEGRATION_TESTS=ON -DMLIR_ENABLE_ROCM_RUNNER=ON -DMLIR_ENABLE_ROCM_CONVERSIONS=ON". I've been using Ubuntu 20.04, specifically a docker image based on the rocm/mlir:rocm5.4-latest image.

pcf000 avatar Apr 25 '23 19:04 pcf000