Unknown error on Vulkan backend
What happened?
The runtime causes a device failure on AMD 780M. It looks like some kind of memory issue, but unusually it's not an allocation that fails, but a deallocation.
EXEC @main
D:\a\iree\iree\c\runtime\src\iree\hal\drivers\vulkan\direct_command_queue.cc:114: UNKNOWN; VkResult=4294967283; while invoking native function hal.device.queue.dealloca; while calling import;
[ 2] native hal.device.queue.dealloca:0 -
[ 1] bytecode compiled_vae.main$async:27102 tmp.txt:251:3
[ 0] bytecode compiled_vae.main:62 tmp.txt:251:3; invoking function 'main'
The reproducer is 170MB, so I can't upload it. Ask me and I'll send it to anyone trying to reproduce it.
Steps to reproduce your issue
iree-compile tmp.txt --iree-vulkan-target-triple=rdna2-unknown-windows --iree-stream-resource-index-bits=64 --iree-hal-target-backends=vulkan-spirv -o tmp.vmfbiree-run-module --device=vulkan --function=main --input='1x4x64x64xf16' --module=tmp.vmfb
What component(s) does this issue relate to?
Runtime
Version information
b4273a4bfc66ba6dd8f62f6483d74d42a7b936f1
Additional context
No response
FWIW, I tried to reproduce this on my machine (NVIDIA 2080TI GPU) both without --iree-vulkan-target-triple and with --iree-vulkan-target-triple=turing-unknown-windows. Both of those failed to compile, making this tricky to help with as long as the pipeline is this brittle.
With turing-unknown-windows:
λ D:\dev\projects\iree-build\tools\iree-compile.exe D:\dev\projects\iree-tmp\issue_17060.mlir --iree-vulkan-target-triple=turing-unknown-windows --iree-stream-resource-index-bits=64 --iree-hal-target-backends=vulkan-spirv --iree-hal-executable-debug-level=3 -o D:\dev\projects\iree-tmp\issue_17060.vmfb
failed to translate executables
failed to translate executables
failed to translate executables
<unknown>:0: error: operands must be in the order AOp, BOp, COp
<unknown>:0: note: see current operation: %78 = "gpu.subgroup_mma_compute"(%54, %70, %arg5) : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
D:\dev\projects\iree-tmp\issue_17060.mlir:578:8: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan-spirv"
With no target triple (conservative default):
λ D:\dev\projects\iree-build\tools\iree-compile.exe D:\dev\projects\iree-tmp\issue_17060.mlir --iree-stream-resource-index-bits=64 --iree-hal-target-backends=vulkan-spirv --iree-hal-executable-debug-level=3 -o D:\dev\projects\iree-tmp\issue_17060.vmfb
failed to translate executables
failed to translate executables
failed to translate executables
failed to translate executables
failed to translate executables
failed to translate executables
D:\dev\projects\iree-tmp\issue_17060.mlir:1509:8: error: failed to legalize operation 'arith.fptosi' that was explicitly marked illegal
%577 = torch.prims.convert_element_type %576, %int4 : !torch.vtensor<[128],f32>, !torch.int -> !torch.vtensor<[128],si64>
^
... (other similar errors) ...
D:\dev\projects\iree-tmp\issue_17060.mlir:310:22: error: 'func.func' op uses 8388736 bytes of shared memory; exceeded the limit of 16384 bytes
%result0, %result1 = torch.aten.var_mean.correction %17, %18, %int0_18, %true : !torch.vtensor<[1,32,16,4096],f32>, !torch.list<int>, !torch.int, !torch.bool -> !torch.vtensor<[1,32,1,1],f32>, !torch.vtensor<[1,32,1,1],f32>
^
D:\dev\projects\iree-tmp\issue_17060.mlir:253:6: note: called from
%1 = call @decode_inp(%0) : (!torch.vtensor<[1,4,64,64],f16>) -> !torch.vtensor<[1,128,512,512],f32>
^
Does it need to be tuned for 780M shared memory sizes ?
Does it need to be tuned for 780M shared memory sizes ?
It's happening on 7900s now too, so I don't think it's a hardware issue. Maybe it's a driver thing, dunno how we could effectively test that, though.
780M is RDNA3. I'm need to set up dev env on my machine and various meetings so won't get to this til later today or tomorrow. In the meanwhile can you try to compile with rdna3-unknown-unknwon and run?
780M is RDNA3. I'm need to set up dev env on my machine and various meetings so won't get to this til later today or tomorrow. In the meanwhile can you try to compile with rdna3-unknown-unknwon and run?
Yep, getting the same compile error that @ScottTodd has compiling to turing-unknown-unknown.
Okay finally I can repro the originally reported runtime issue. It is a driver timeout. Likely due to the weak igpu and we are also not codegen using wmma ops due to rdna2 triple. It's a pain for me to work with the corp machine for development I have right now with 780M--lots of restrictions and I still cannot have a functioning toolchain (both msvc and clang broke for various reasons) so need to build on another windows machine and copy over. I'd need to figure out a better story to play with it. A few things to try out:
- @gpetters-amd can you try to increase the timeout threshold on windows following https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys ? I cannot do it on my side because cannot modify registers. Just wanted to confirm the issue.
- If the above works (even if it doesn't work), try to capture a tracy profile to see which kernel we are being particularly slow.
- We need to fix the compilation issue for rdna3 to generate faster code for the igpu.
Pasting the problematic dispatch here. Repro with tools/iree-compile --compile-from=executable-configurations:
hal.executable public @main$async_dispatch_20 {
hal.executable.variant public @vulkan_spirv_fb target(<"vulkan-spirv", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, PhysicalStorageBufferAddresses, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_integer_dot_product, SPV_KHR_storage_buffer_storage_class, SPV_KHR_physical_storage_buffer, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = i8, b_type = i8, c_type = i32, result_type = i32, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f32, result_type = f32, acc_sat = false, scope = <Subgroup>>]>>}>) {
hal.executable.export public @main$async_dispatch_20_matmul_transpose_b_4096x512x512_f16xf16xf32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>]} {
^bb0(%arg0: !hal.device):
%x, %y, %z = flow.dispatch.workgroup_count_from_slice
hal.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @main$async_dispatch_20_matmul_transpose_b_4096x512x512_f16xf16xf32() attributes {translation_info = #iree_codegen.translation_info<SPIRVCooperativeMatrixVectorize workgroup_size = [64, 2, 1] subgroup_size = 32, {pipeline_depth = 1 : i64, store_stage = 0 : i64}>} {
%cst = arith.constant 0.000000e+00 : f32
%c128 = arith.constant 128 : index
%c86398720 = arith.constant 86398720 : index
%c86397696 = arith.constant 86397696 : index
%c16877632 = arith.constant 16877632 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c128) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4096x512xf16>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c86398720) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<512x512xf16>>
%2 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c86397696) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<512xf16>>
%3 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c16877632) : !flow.dispatch.tensor<writeonly:tensor<4096x512xf16>>
%4 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [4096, 512], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<4096x512xf16>> -> tensor<4096x512xf16>
%5 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [512, 512], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<512x512xf16>> -> tensor<512x512xf16>
%6 = flow.dispatch.tensor.load %2, offsets = [0], sizes = [512], strides = [1] : !flow.dispatch.tensor<readonly:tensor<512xf16>> -> tensor<512xf16>
%7 = tensor.empty() : tensor<4096x512xf16>
%8 = tensor.empty() : tensor<4096x512xf32>
%9 = linalg.fill ins(%cst : f32) outs(%8 : tensor<4096x512xf32>) -> tensor<4096x512xf32>
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%4, %5 : tensor<4096x512xf16>, tensor<512x512xf16>) outs(%9 : tensor<4096x512xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 128], [32, 64], [0, 0, 32], [16, 16, 16]]>} {
^bb0(%in: f16, %in_0: f16, %out: f32):
%12 = arith.extf %in : f16 to f32
%13 = arith.extf %in_0 : f16 to f32
%14 = arith.mulf %12, %13 : f32
%15 = arith.addf %out, %14 : f32
linalg.yield %15 : f32
} -> tensor<4096x512xf32>
%11 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%10, %6 : tensor<4096x512xf32>, tensor<512xf16>) outs(%7 : tensor<4096x512xf16>) {
^bb0(%in: f32, %in_0: f16, %out: f16):
%12 = arith.truncf %in : f32 to f16
%13 = arith.addf %12, %in_0 : f16
linalg.yield %13 : f16
} -> tensor<4096x512xf16>
flow.dispatch.tensor.store %11, %3, offsets = [0, 0], sizes = [4096, 512], strides = [1, 1] : tensor<4096x512xf16> -> !flow.dispatch.tensor<writeonly:tensor<4096x512xf16>>
return
}
}
}
}
It seems the issue is inferFragType not seeing through arith.extf ops.
https://github.com/llvm/llvm-project/pull/91988 for fixing the compilation to make it compilable for rdna3.
https://github.com/llvm/llvm-project/pull/91988 is landed. Just need an llvm integration to pull it in: https://github.com/iree-org/iree/pull/17380