Bug Report

Describe the bug

Exporting a torch model with a clip operation results in different behaviour with NaN values when inferencing with ONNX CPU vs GPU.

System information

OS Platform and Distribution (e.g. Linux Ubuntu 20.04): 22.04
ONNX version (e.g. 1.13):

~> pip freeze | grep -i onnx
onnx==1.15.0
onnxruntime==1.17.1
onnxruntime-gpu==1.17.1
onnxruntime_extensions==0.10.1

Python version: 3.10.12

Reproduction instructions

import torch
from torch import nn

import numpy as np
import onnxruntime as ort

arr = np.random.randn(16, 10).astype(np.float32)
arr[0, :] = np.nan


def run_onnx_inference(sessions, inputs_by_session) -> "np.ndarray":
    ort_outputs = []
    for sess, inputs in zip(sessions, inputs_by_session, strict=True):
        ort_inputs = {k.name: arr for k, arr in zip(sess.get_inputs(), inputs, strict=True)}
        ort_outputs.append(np.hstack(sess.run(None, ort_inputs)))
    return np.hstack(ort_outputs)


class ONNXModel(nn.Module):
    def forward(self, x):
        lower = torch.tensor([-10.0] * x.shape[1])
        upper = torch.tensor([10.0] * x.shape[1])
        x = x.clip(lower, upper)
        x[torch.isnan(x)] = 0.0
        return x

export_model = ONNXModel()
filename = "test.onnx"

torch.onnx.export(
    export_model,
    torch.randn(16, 10),
    filename,
    verbose=False,
    export_params=True,
    opset_version=17,
    do_constant_folding=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": [0], "output": [0]},
)


onnx_gpu_model = ort.InferenceSession(filename, providers=["CUDAExecutionProvider"])
onnx_gpu = run_onnx_inference([onnx_gpu_model], [[arr]])

onnx_cpu_model = ort.InferenceSession(filename, providers=["CPUExecutionProvider"])
onnx_cpu = run_onnx_inference([onnx_cpu_model], [[arr]])

diff = onnx_cpu - onnx_gpu

This outputs

array([[10., 10., 10., 10., 10., 10., 10., 10., 10., 10.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]], dtype=float32)

It looks like in the CPU case, clip(NaN, -value, value) returns NaN but in the GPU case it's returning -value.

Mar 07 '24 17:03 dannyfriar

Could you paste the difference too?

Mar 07 '24 18:03 justinchuby

@justinchuby sure I've added that now

Mar 07 '24 19:03 dannyfriar

The issue is more suited to be filed under onnxruntime repo.

Looks like an additional check for NaN is needed on line https://github.com/microsoft/onnxruntime/blob/33578cc76efc19b50c9fc011215b2777de193cd1/onnxruntime/core/providers/cuda/math/clip_impl.cu#L14

cc @yuslepukhin to check if my understanding is correct.

Mar 07 '24 22:03 BowenBao

@BowenBao I think you're correct that this is an onnxruntime issue rather than onnx, but the problem appears to be in the Min and Max operator implementations rather than Clip.

When the clip bounds are arrays, torch exports this to ONNX as a Max followed by a Min, and I can reproduce this with a simpler example that doesn't use torch and demonstrates the problem using only the Min operator:

import torch  # Not used but initializing the CUDA execution provider fails without it...

import numpy as np
import onnx
from onnx.onnx_pb import TensorProto
import onnxruntime

input = onnx.helper.make_tensor_value_info("input", TensorProto.FLOAT, ["N", 10])
output = onnx.helper.make_tensor_value_info("output", TensorProto.FLOAT, ["N", 10])

min_const = onnx.helper.make_node(
        "Constant",
        inputs=[],
        outputs=["min_const"],
        value=onnx.numpy_helper.from_array(np.array([10.0] * 10, dtype=np.float32)))

min_node = onnx.helper.make_node(
        "Min",
        inputs=["input", "min_const"],
        outputs=["output"],
        )

graph_def = onnx.helper.make_graph(
        nodes=[min_const, min_node],
        name="test-model",
        inputs=[input],
        outputs=[output])

opset_import = onnx.helper.make_opsetid("", 17)

model_def = onnx.helper.make_model(
        graph_def,
        opset_imports=[opset_import],
        producer_name="test")

onnx.checker.check_model(model_def, full_check=True)

model_path = 'test_min.onnx'
onnx.save(model_def, model_path)

input = np.random.randn(3, 10).astype(np.float32)
input[0, :] = np.nan

cpu_session = onnxruntime.InferenceSession(model_path, providers=["CPUExecutionProvider"])
output = cpu_session.run(["output"], {"input": input})
print("CPU session output:")
print(output)

gpu_session = onnxruntime.InferenceSession(model_path, providers=["CUDAExecutionProvider"])
output = gpu_session.run(["output"], {"input": input})
print("GPU session output:")
print(output)

This outputs something like:

CPU session output:
[array([[        nan,         nan,         nan,         nan,         nan,
                nan,         nan,         nan,         nan,         nan],
       [-0.73628867, -1.0645038 , -0.29687342, -0.06496124,  0.40141365,
        -0.36313328, -0.17520589,  0.08746424,  0.30066383, -1.3963577 ],
       [ 0.8791592 ,  0.08518761, -1.1299503 ,  0.12336332, -0.02993149,
         0.1656782 , -1.5760034 ,  0.14083968, -0.37705085,  2.0208693 ]],
      dtype=float32)]
GPU session output:
[array([[10.        , 10.        , 10.        , 10.        , 10.        ,
        10.        , 10.        , 10.        , 10.        , 10.        ],
       [-0.73628867, -1.0645038 , -0.29687342, -0.06496124,  0.40141365,
        -0.36313328, -0.17520589,  0.08746424,  0.30066383, -1.3963577 ],
       [ 0.8791592 ,  0.08518761, -1.1299503 ,  0.12336332, -0.02993149,
         0.1656782 , -1.5760034 ,  0.14083968, -0.37705085,  2.0208693 ]],
      dtype=float32)]

An equivalent example that uses the Max operator instead also shows the same problem.

If I change the original reproduction script to use scalar bounds then this doesn't reproduce the issue, as the Clip operator is used instead, and the CUDA implementation of Clip does seem to handle NaN values correctly:

class ONNXModel(nn.Module):
    def forward(self, x):
        lower = torch.scalar_tensor(-10.0)
        upper = torch.scalar_tensor(10.0)
        x = x.clip(lower, upper)
        x[torch.isnan(x)] = 0.0
        return x

The original model exported to ONNX looks like this in Netron: clip_with_array_bounds

And with scalar bounds it looks like this: clip_with_scalar_bounds

Mar 07 '24 22:03 adamreeve

As a workaround to this issue, and to ensure explicit handling of NaN values during clipping operations, I implemented a masking strategy prior to the application of the clip function. This strategy is designed to maintain the integrity of NaN values, ensuring they remain unaffected by the clipping process. The solution involves several key steps in the PyTorch model's forward method.


    def forward(self, x):
        # Create a mask for NaN values in the input tensor
        nan_mask = torch.isnan(x)

        # Specify lower and upper bounds for the clipping operation
        lower = torch.tensor([-10.0] * x.shape[1])
        upper = torch.tensor([10.0] * x.shape[1])

        # Perform clipping on the tensor, enforcing the specified bounds
        x = x.clip(lower, upper)

        # Restore NaN values using the previously created mask, ensuring they are left unchanged
        x = torch.where(nan_mask, torch.tensor(0, dtype=x.dtype, device=x.device), x)

        return x

Mar 07 '24 22:03 tolleybot

onnx
onnx copied to clipboard

`clip` behaviour with NaN values is different between GPU and CPU onnx inference

Bug Report

Describe the bug

System information

Reproduction instructions

onnx onnx copied to clipboard

`clip` behaviour with NaN values is different between GPU and CPU onnx inference

Bug Report

Describe the bug

System information

Reproduction instructions

onnx
onnx copied to clipboard