iree
iree copied to clipboard
[spirv] Inaccurate TF ConvBert result on Apple M GPUs
What happened?
On comparing the results obtained from TensorFlow with SHARK results the difference is more than the tolerance range. Following is the error message shown:
AssertionError:
Not equal to tolerance rtol=0.01, atol=0.001
Mismatched elements: 501275 / 512000 (97.9%)
Max absolute difference: 4.017738
Max relative difference: 243874.06
x: array([[[ 1.43804 , -1.28011 , 0.285097, ..., -2.139163, -0.606236,
-1.118984],
[-1.287601, 0.407412, -0.379824, ..., 0.158129, 1.626622,...
y: array([[[ 1.871550e+00, -1.336534e+00, 1.800059e-01, ...,
-1.499211e+00, -8.562328e-01, -1.358510e-01],
[-1.766060e+00, 6.850528e-01, -2.355140e-01, ...,...
Steps to reproduce your issue
The error can be reproduced using the following script:
from shark.shark_inference import SharkInference
from shark.shark_downloader import download_tf_model
import numpy as np
if __name__ == "__main__":
model, func_name, inputs, golden_out = download_tf_model("dbmdz/convbert-base-turkish-cased")
shark_module = SharkInference(
model, func_name, device="vulkan", mlir_dialect="mhlo"
)
shark_module.compile()
result = shark_module.forward(inputs)
np.testing.assert_allclose(golden_out, result, rtol=1e-02, atol=1e-03)
What component(s) does this issue relate to?
No response
Version information
No response
Additional context
- VulkanSDK needs to be installed on the system.
- IREE built from source code with Vulkan flags enabled is also able to reproduce the error.
- To execute with
m1-moltenvk-macos
target triple on Apple M2, please make the following change in the SHARK source code file : https://github.com/nod-ai/SHARK/blob/d556c0d6ef8f69b32bc3b2d28165345dd2faf403/shark/iree_utils/vulkan_utils.py#L23
replace : if vulkan_device == "M1":
with : if vulkan_device == "M1" or vulkan_device == "M2":
please attach a link to the .mlir and iree command line to execute / recreate it
+1. It would be much easier for me to look into the issue with an input mlir file. Also as @stellaraccident asked in the other issue, is this specific to M2? (I'd suspect not but need to double check.)
please attach a link to the .mlir and iree command line to execute / recreate it
Command :
<PATH TO ..../iree-compile> - --iree-input-type=mhlo --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --iree-llvm-embedded-linker-path=<PATH TO ..../iree-lld> --mlir-print-debuginfo --mlir-print-op-on-diagnostic=true --iree-llvm-target-cpu-features=host --iree-mhlo-demote-i64-to-i32=false --iree-flow-demote-i64-to-i32 -iree-vulkan-target-triple=m1-moltenvk-macos --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64
Input MLIR: https://storage.googleapis.com/shark_tank/dbmdz_convbert-base-turkish-cased_tf/dbmdz_convbert-base-turkish-cased_tf.mlir
@antiagainst We checked and found that this issue is also present on M1 Vulkan as suspected
We are seeing the issue of different results with and without --iree-flow-trace-dispatch-tensors
again:
local-task:
1x16x32000xf32=[[0.607319 -1.07316 0.898614 -0.267287 1.78744 -0.263523 1.01242 -0.2313 -2.19909 -2.82577 -2.44984 0.527114 -0.46196 0.275833 -1.16742 -0.420368 ...
vulkan (w/ tracing):
1x16x32000xf32=[[0.607278 -1.0732 0.898576 -0.267251 1.78746 -0.263542 1.01245 -0.231225 -2.19902 -2.82572 -2.44985 0.52708 -0.461983 0.275756 -1.1674 -0.420419 ...
vulkan (w/o tracing):
1x16x32000xf32=[[-0.698902 -1.57589 1.41733 0.851334 1.94573 -0.392987 1.02575 -0.61025 -3.48231 -3.52306 -1.55764 -0.271172 -0.189102 -0.334609 -0.209776 0.191701
With it the result is correct. Last time it was gone but I guess we just got lucky. Still need to root cause it properly.
You can try compiling with --iree-stream-partitioning-favor=debug
which disables all concurrency and puts a barrier between each dispatch - that'd narrow down whether it was multiple dispatches stomping on each other or something host/device.
Closing this for now given this is Vulkan on MoltenVK -- we have native Metal support and that's the way forward.