iree [spirv] Inaccurate TF ConvBert result on Apple M GPUs

What happened?

On comparing the results obtained from TensorFlow with SHARK results the difference is more than the tolerance range. Following is the error message shown:

AssertionError: 
Not equal to tolerance rtol=0.01, atol=0.001

Mismatched elements: 501275 / 512000 (97.9%)
Max absolute difference: 4.017738
Max relative difference: 243874.06
x: array([[[ 1.43804 , -1.28011 ,  0.285097, ..., -2.139163, -0.606236,
        -1.118984],
       [-1.287601,  0.407412, -0.379824, ...,  0.158129,  1.626622,...
y: array([[[ 1.871550e+00, -1.336534e+00,  1.800059e-01, ...,
        -1.499211e+00, -8.562328e-01, -1.358510e-01],
       [-1.766060e+00,  6.850528e-01, -2.355140e-01, ...,...

Steps to reproduce your issue

The error can be reproduced using the following script:

from shark.shark_inference import SharkInference
from shark.shark_downloader import download_tf_model
import numpy as np

if __name__ == "__main__":
    model, func_name, inputs, golden_out = download_tf_model("dbmdz/convbert-base-turkish-cased")

    shark_module = SharkInference(
        model, func_name, device="vulkan", mlir_dialect="mhlo"
    )

    shark_module.compile()
    result = shark_module.forward(inputs)
    np.testing.assert_allclose(golden_out, result, rtol=1e-02, atol=1e-03)

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

VulkanSDK needs to be installed on the system.
IREE built from source code with Vulkan flags enabled is also able to reproduce the error.
To execute with m1-moltenvk-macos target triple on Apple M2, please make the following change in the SHARK source code file : https://github.com/nod-ai/SHARK/blob/d556c0d6ef8f69b32bc3b2d28165345dd2faf403/shark/iree_utils/vulkan_utils.py#L23

replace : if vulkan_device == "M1": with : if vulkan_device == "M1" or vulkan_device == "M2":

Aug 01 '22 11:08 PhaneeshB

please attach a link to the .mlir and iree command line to execute / recreate it

Aug 01 '22 16:08 powderluv

+1. It would be much easier for me to look into the issue with an input mlir file. Also as @stellaraccident asked in the other issue, is this specific to M2? (I'd suspect not but need to double check.)

Aug 01 '22 18:08 antiagainst

please attach a link to the .mlir and iree command line to execute / recreate it

Command :

<PATH TO ..../iree-compile> - --iree-input-type=mhlo --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --iree-llvm-embedded-linker-path=<PATH TO ..../iree-lld> --mlir-print-debuginfo --mlir-print-op-on-diagnostic=true  --iree-llvm-target-cpu-features=host --iree-mhlo-demote-i64-to-i32=false --iree-flow-demote-i64-to-i32 -iree-vulkan-target-triple=m1-moltenvk-macos --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64

Input MLIR: https://storage.googleapis.com/shark_tank/dbmdz_convbert-base-turkish-cased_tf/dbmdz_convbert-base-turkish-cased_tf.mlir

Aug 02 '22 22:08 PhaneeshB

@antiagainst We checked and found that this issue is also present on M1 Vulkan as suspected

Aug 02 '22 23:08 PhaneeshB

We are seeing the issue of different results with and without --iree-flow-trace-dispatch-tensors again:

local-task:

1x16x32000xf32=[[0.607319 -1.07316 0.898614 -0.267287 1.78744 -0.263523 1.01242 -0.2313 -2.19909 -2.82577 -2.44984 0.527114 -0.46196 0.275833 -1.16742 -0.420368 ...

vulkan (w/ tracing):

1x16x32000xf32=[[0.607278 -1.0732 0.898576 -0.267251 1.78746 -0.263542 1.01245 -0.231225 -2.19902 -2.82572 -2.44985 0.52708 -0.461983 0.275756 -1.1674 -0.420419 ...

vulkan (w/o tracing):

1x16x32000xf32=[[-0.698902 -1.57589 1.41733 0.851334 1.94573 -0.392987 1.02575 -0.61025 -3.48231 -3.52306 -1.55764 -0.271172 -0.189102 -0.334609 -0.209776 0.191701

With it the result is correct. Last time it was gone but I guess we just got lucky. Still need to root cause it properly.

Aug 10 '22 00:08 antiagainst

You can try compiling with --iree-stream-partitioning-favor=debug which disables all concurrency and puts a barrier between each dispatch - that'd narrow down whether it was multiple dispatches stomping on each other or something host/device.

Aug 10 '22 00:08 benvanik

Closing this for now given this is Vulkan on MoltenVK -- we have native Metal support and that's the way forward.

Jun 24 '23 00:06 antiagainst

iree iree copied to clipboard

[spirv] Inaccurate TF ConvBert result on Apple M GPUs

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

iree
iree copied to clipboard