onnx
onnx copied to clipboard
`clip` behaviour with NaN values is different between GPU and CPU onnx inference
Bug Report
Describe the bug
Exporting a torch model with a clip operation results in different behaviour with NaN values when inferencing with ONNX CPU vs GPU.
System information
- OS Platform and Distribution (e.g. Linux Ubuntu 20.04): 22.04
- ONNX version (e.g. 1.13):
~> pip freeze | grep -i onnx
onnx==1.15.0
onnxruntime==1.17.1
onnxruntime-gpu==1.17.1
onnxruntime_extensions==0.10.1
- Python version: 3.10.12
Reproduction instructions
import torch
from torch import nn
import numpy as np
import onnxruntime as ort
arr = np.random.randn(16, 10).astype(np.float32)
arr[0, :] = np.nan
def run_onnx_inference(sessions, inputs_by_session) -> "np.ndarray":
ort_outputs = []
for sess, inputs in zip(sessions, inputs_by_session, strict=True):
ort_inputs = {k.name: arr for k, arr in zip(sess.get_inputs(), inputs, strict=True)}
ort_outputs.append(np.hstack(sess.run(None, ort_inputs)))
return np.hstack(ort_outputs)
class ONNXModel(nn.Module):
def forward(self, x):
lower = torch.tensor([-10.0] * x.shape[1])
upper = torch.tensor([10.0] * x.shape[1])
x = x.clip(lower, upper)
x[torch.isnan(x)] = 0.0
return x
export_model = ONNXModel()
filename = "test.onnx"
torch.onnx.export(
export_model,
torch.randn(16, 10),
filename,
verbose=False,
export_params=True,
opset_version=17,
do_constant_folding=True,
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": [0], "output": [0]},
)
onnx_gpu_model = ort.InferenceSession(filename, providers=["CUDAExecutionProvider"])
onnx_gpu = run_onnx_inference([onnx_gpu_model], [[arr]])
onnx_cpu_model = ort.InferenceSession(filename, providers=["CPUExecutionProvider"])
onnx_cpu = run_onnx_inference([onnx_cpu_model], [[arr]])
diff = onnx_cpu - onnx_gpu
This outputs
array([[10., 10., 10., 10., 10., 10., 10., 10., 10., 10.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
It looks like in the CPU case, clip(NaN, -value, value) returns NaN but in the GPU case it's returning -value.
Could you paste the difference too?
@justinchuby sure I've added that now
The issue is more suited to be filed under onnxruntime repo.
Looks like an additional check for NaN is needed on line https://github.com/microsoft/onnxruntime/blob/33578cc76efc19b50c9fc011215b2777de193cd1/onnxruntime/core/providers/cuda/math/clip_impl.cu#L14
cc @yuslepukhin to check if my understanding is correct.
@BowenBao I think you're correct that this is an onnxruntime issue rather than onnx, but the problem appears to be in the Min and Max operator implementations rather than Clip.
When the clip bounds are arrays, torch exports this to ONNX as a Max followed by a Min, and I can reproduce this with a simpler example that doesn't use torch and demonstrates the problem using only the Min operator:
import torch # Not used but initializing the CUDA execution provider fails without it...
import numpy as np
import onnx
from onnx.onnx_pb import TensorProto
import onnxruntime
input = onnx.helper.make_tensor_value_info("input", TensorProto.FLOAT, ["N", 10])
output = onnx.helper.make_tensor_value_info("output", TensorProto.FLOAT, ["N", 10])
min_const = onnx.helper.make_node(
"Constant",
inputs=[],
outputs=["min_const"],
value=onnx.numpy_helper.from_array(np.array([10.0] * 10, dtype=np.float32)))
min_node = onnx.helper.make_node(
"Min",
inputs=["input", "min_const"],
outputs=["output"],
)
graph_def = onnx.helper.make_graph(
nodes=[min_const, min_node],
name="test-model",
inputs=[input],
outputs=[output])
opset_import = onnx.helper.make_opsetid("", 17)
model_def = onnx.helper.make_model(
graph_def,
opset_imports=[opset_import],
producer_name="test")
onnx.checker.check_model(model_def, full_check=True)
model_path = 'test_min.onnx'
onnx.save(model_def, model_path)
input = np.random.randn(3, 10).astype(np.float32)
input[0, :] = np.nan
cpu_session = onnxruntime.InferenceSession(model_path, providers=["CPUExecutionProvider"])
output = cpu_session.run(["output"], {"input": input})
print("CPU session output:")
print(output)
gpu_session = onnxruntime.InferenceSession(model_path, providers=["CUDAExecutionProvider"])
output = gpu_session.run(["output"], {"input": input})
print("GPU session output:")
print(output)
This outputs something like:
CPU session output:
[array([[ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan],
[-0.73628867, -1.0645038 , -0.29687342, -0.06496124, 0.40141365,
-0.36313328, -0.17520589, 0.08746424, 0.30066383, -1.3963577 ],
[ 0.8791592 , 0.08518761, -1.1299503 , 0.12336332, -0.02993149,
0.1656782 , -1.5760034 , 0.14083968, -0.37705085, 2.0208693 ]],
dtype=float32)]
GPU session output:
[array([[10. , 10. , 10. , 10. , 10. ,
10. , 10. , 10. , 10. , 10. ],
[-0.73628867, -1.0645038 , -0.29687342, -0.06496124, 0.40141365,
-0.36313328, -0.17520589, 0.08746424, 0.30066383, -1.3963577 ],
[ 0.8791592 , 0.08518761, -1.1299503 , 0.12336332, -0.02993149,
0.1656782 , -1.5760034 , 0.14083968, -0.37705085, 2.0208693 ]],
dtype=float32)]
An equivalent example that uses the Max operator instead also shows the same problem.
If I change the original reproduction script to use scalar bounds then this doesn't reproduce the issue, as the Clip operator is used instead, and the CUDA implementation of Clip does seem to handle NaN values correctly:
class ONNXModel(nn.Module):
def forward(self, x):
lower = torch.scalar_tensor(-10.0)
upper = torch.scalar_tensor(10.0)
x = x.clip(lower, upper)
x[torch.isnan(x)] = 0.0
return x
The original model exported to ONNX looks like this in Netron:
And with scalar bounds it looks like this:
As a workaround to this issue, and to ensure explicit handling of NaN values during clipping operations, I implemented a masking strategy prior to the application of the clip function. This strategy is designed to maintain the integrity of NaN values, ensuring they remain unaffected by the clipping process. The solution involves several key steps in the PyTorch model's forward method.
def forward(self, x):
# Create a mask for NaN values in the input tensor
nan_mask = torch.isnan(x)
# Specify lower and upper bounds for the clipping operation
lower = torch.tensor([-10.0] * x.shape[1])
upper = torch.tensor([10.0] * x.shape[1])
# Perform clipping on the tensor, enforcing the specified bounds
x = x.clip(lower, upper)
# Restore NaN values using the previously created mask, ensuring they are left unchanged
x = torch.where(nan_mask, torch.tensor(0, dtype=x.dtype, device=x.device), x)
return x