[GPU] : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION in GPU while passing inference in CPU
input.0.bin.txt input.1.bin.txt input.2.bin.txt
What happened?
For attached IR, we are seeing error as
:0:rocdevice.cpp :3006: 1267514219452d us: Callback: Queue 0x749caff00000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
while same is passing in CPU with functional correctness. Due to weight, file size is becoming > 25M so uploaded in zip.
Steps to reproduce your issue
1> Download the zip file and unzip with command 'unzip model.torch_onnx.mlir.zip'
command to reproduce the issue on MI300:
iree-opt -pass-pipeline='builtin.module(func.func(convert-torch-onnx-to-torch))' model.torch_onnx.mlir -o model.torch.mlir
iree-opt -pass-pipeline='builtin.module(torch-lower-to-backend-contract,func.func(torch-scalarize-shapes),torch-shape-refinement-pipeline,torch-backend-to-linalg-on-tensors-backend-pipeline)' model.torch.mlir -o model.modified.mlir
iree-compile model.modified.mlir --iree-hal-target-backends=rocm --iree-hip-target=gfx942 -o compiled_model.vmfb
iree-run-module --module='compiled_model.vmfb' --device=hip --function='main_graph' --input='[email protected]' --input='[email protected]' --input='[email protected]' --output=@'output.0.bin' --output=@'output.1.bin'
This is impacting 600+ models so please treat this as high priority
What component(s) does this issue relate to?
Runtime
Version information
No response
Additional context
No response
I was able to solve this error by removing the input sizes and only using the input file, i.e. using --input='@input.0.bin' instead of --input='[email protected]'. It seems like GPU doesn't support the input sizes and input file at the same time.
I was able to solve this error by removing the input sizes and only using the input file, i.e. using
--input='@input.0.bin'instead of--input='[email protected]'. It seems like GPU doesn't support the input sizes and input file at the same time.
is the output close to cpu atleast in shape? I wonder if without the input shape its not actually taking the input we expect (which may be dynamically shaped) and just producing garbage. Also normally, input sizes are required for .bin files.
OOC why are the repro steps using iree-opt?
iree-opt -pass-pipeline='builtin.module(func.func(convert-torch-onnx-to-torch))' model.torch_onnx.mlir -o model.torch.mlir
iree-opt -pass-pipeline='builtin.module(torch-lower-to-backend-contract,func.func(torch-scalarize-shapes),torch-shape-refinement-pipeline,torch-backend-to-linalg-on-tensors-backend-pipeline)' model.torch.mlir -o model.modified.mlir
That sort of manual pipeline specification is unsupported. For any user workflows, use iree-compile and let it handle which pipelines to run.
I was able to solve this error by removing the input sizes and only using the input file, i.e. using
--input='@input.0.bin'instead of--input='[email protected]'. It seems like GPU doesn't support the input sizes and input file at the same time.is the output close to cpu atleast in shape? I wonder if without the input shape its not actually taking the input we expect (which may be dynamically shaped) and just producing garbage.
Ah I see, not sure, I was encountering this error with benchmarking Llama on GPU last night which does have some dynamic shaped inputs. But, removing the input shapes from iree-benchmark-module and only using numpy files as the inputs I was able to run/benchmark without this error.
But, removing the input shapes from
iree-benchmark-moduleand only using numpy files as the inputs I was able to run/benchmark without this error.
Numpy files contain shape information (metadata + buffer contents). Binary files do not (just buffer contents). If using numpy, you can (should?) omit the 1x128xi64. If using binary, you need it (otherwise the runtime doesn't know how to interpret the buffer)
Few things here
- Here when I am returning from 296, same set of command (with input size) works fine, and generated output is matching with CPU. The error is coming when returning from 297 in GPU
%294 = torch.operator "onnx.Mul"(%290, %293) : (!torch.vtensor<[?,2,64,?],f32>, !torch.vtensor<[1],f32>) -> !torch.vtensor<[?,2,64,?],f32>
%295 = torch.operator "onnx.MatMul"(%292, %294) : (!torch.vtensor<[?,2,?,64],f32>, !torch.vtensor<[?,2,64,?],f32>) -> !torch.vtensor<[?,2,?,?],f32>
%296 = torch.operator "onnx.Add"(%295, %100) : (!torch.vtensor<[?,2,?,?],f32>, !torch.vtensor<[?,?,?,?],f32>) -> !torch.vtensor<[?,2,?,?],f32>
%297 = torch.operator "onnx.Softmax"(%296) {torch.onnx.axis = -1 : si64} : (!torch.vtensor<[?,2,?,?],f32>) -> !torch.vtensor<[?,2,?,?],f32>
return %296: !torch.vtensor<[?,2,?,?],f32>
- When these set of command i.e input with size (which is needed for bin file) is working in CPU then we should have that same behavior in GPU as well