onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[Performance] nearest neighbor Resize operator is significantly slower than pytorch for 3D tensors

Open SimonRelu opened this issue 2 years ago • 22 comments

Describe the issue

I have a model that I'd like to run using onnxruntime to decrease the install size. Here's a table comparing the total time of ONNX and pytorch when running different amount of inference (time in seconds)

1 5 10 50 100
pytorch 0.45 0.75 1.11 3.85 7.17
onnxruntime 0.26 0.79 1.51 6.98 13.87

To me it seems like while the first inference of pytorch is slow, sequential once are much faster compared to onnx. I believe this has to do with cudnn plans not being cached. When I digged into the pytorch code I found that the plan for convultion layers is being cached. In the run_single_conv function in Conv_v8.cpp, a function benchmark_cache.find(key); is executed which will search the cache to see if a convoltion plan with the same parameters has already been calculated. If not, it generates the plan and adds it to the cache. I was wondering if onnxruntime has something similar in place since I couldn't find this.

To reproduce

I convert the model like this:

import torch

input_data = torch.randn((1, 1, 96, 96, 96)).cuda().half()
torch.onnx.export(
    model,  # model being run
    input_data,  # model input (or a tuple for multiple inputs)
    "model.onnx",  # where to save the model (can be a file or file-like object)
    export_params=True,  # store the trained parameter weights inside the model file
    opset_version=11,  # the ONNX version to export the model to
    do_constant_folding=True,  # whether to execute constant folding for optimization
    input_names=["input"],  # the model's input names
    output_names=["output"],  # the model's output names
    keep_initializers_as_inputs=False,
)

And run inference as followes:

import onnxruntime
import numpy as np
import time

NUMBER: int = 1

opts = onnxruntime.SessionOptions()
# opts.enable_profiling = True
opts.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
# opts.intra_op_num_threads = 1

ort_session = onnxruntime.InferenceSession(
    "model.onnx",
    opts,
    providers=[
        (
            "CUDAExecutionProvider",
            {
                "device_id": 0,
                # "arena_extend_strategy": "kNextPowerOfTwo",
                # "do_copy_in_default_stream": False,
                "cudnn_conv_use_max_workspace": "1",
                # "gpu_mem_limit": 4 * 1024 * 1024 * 1024,
                # "cudnn_conv1d_pad_to_nc1d": "1",
                "cudnn_conv_algo_search": "HEURISTIC",
            },
        ),
        "CPUExecutionProvider",
    ],
)
def get_data():
    return np.array(np.random.rand(1, 1, 96, 96, 96) * 20, dtype=np.float16)

# compute ONNX Runtime output prediction
start = time.time()
io_binding = ort_session.io_binding()
for i in range(NUMBER):
    io_binding.bind_cpu_input("input", get_data())
    io_binding.bind_output("output")
    ort_session.run_with_iobinding(io_binding)
    io_binding.copy_outputs_to_cpu()[0]
print(f"ONNX: {time.time() - start}")

Sadly I'm not able to share the model but I'm using a 3D Unet model similar to this model.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 18.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.13.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

SimonRelu avatar Feb 06 '23 13:02 SimonRelu

Update: conv.cc in providers/cuda/nn does have cached_benchmark_results which works. I've verified that after the first run, the execution plan is always found in cached_benchmark_results

SimonRelu avatar Feb 06 '23 15:02 SimonRelu

@SimonRelu, could you please profile the model and see if all nodes are running on GPU?

yufenglee avatar Feb 07 '23 05:02 yufenglee

I've verified that all models are running on CUDA.

I've also profiled the model and the execution stack looks like this: Screenshot 2023-02-08 at 14 33 24

I'm viewing it with chrome://tracing/ . Does this look normal or should I use another tool?

SimonRelu avatar Feb 08 '23 13:02 SimonRelu

So I have to set CUDA_LAUNCH_BLOCKING=1 since otherwise the profiler doesn't properly track CUDA execution times. With that I get this: Screenshot 2023-02-09 at 12 30 56 The slow performance thus comes from the Resize op. which originally was aten::upsample_nearest3d in pytorch. Any idea why this is so much slower in onnxruntime?

SimonRelu avatar Feb 09 '23 11:02 SimonRelu

Thanks for sharing the profiling result. What's the input shape and upsample scale? I'll investigate the cuda kernel. If it need improvement, we can probably update the kernel soon.

tianleiwu avatar Feb 09 '23 16:02 tianleiwu

Yes, the input is:

data: Tensor[1, 128, 48, 48, 48] scale: [1, 1, 2, 2, 2] roi: empty mode: "nearest" nearest_mode: "floor"

Thank you for looking into this

SimonRelu avatar Feb 09 '23 17:02 SimonRelu

Hey @tianleiwu , I was wondering if you managed to reproduce the issue and if there is still anything you need from me?

SimonRelu avatar Feb 17 '23 08:02 SimonRelu

Hey, I understand that this might not be the highest priority but I was wondering when you'd be able to look into this. This issue is currently blocking us from releasing with onnxruntime. If it isn't possible to fix this in the near future I'd look into including aten::upsample_nearest3d as a custom op so we can move forward with the release. Thank you for the info :)

SimonRelu avatar Feb 22 '23 18:02 SimonRelu

@SimonRelu, sorry for late response. I am looking into the issue and will let you know once there is progress.

tianleiwu avatar Feb 22 '23 18:02 tianleiwu

I did some debugging; it uses block=14 and threads=32 for the input, which seems not enough. Let me check aten::upsample_nearest3d to see whether we can leverage similar partition logic.

Update: The partitions of 14 * 32 are for mapping only. For resize, it uses 442368 * 256 threads, so it is reasonable.

tianleiwu avatar Feb 23 '23 01:02 tianleiwu

That sounds good. Thank you for looking into this :)

SimonRelu avatar Feb 24 '23 09:02 SimonRelu

@zhanghuanrong, please help take a look at the resize cuda kernel since you implemented that part.

tianleiwu avatar Feb 24 '23 22:02 tianleiwu

Hi, SimonRelu, could you please also using nsys profile and show us the kernel name for it, and how long each kernel runs?

zhanghuanrong avatar Feb 24 '23 22:02 zhanghuanrong

Hey zhanghuanrong, the kernel is called _ResizeNearestKernel and it spend 58ms seconds there (after having several warmup inference runs) when resizing the Tensor[1, 128, 48, 48, 48].

I also resize a kernel of shape Tensor[1, 256, 24, 24, 24] and Tensor[1, 512, 12, 12, 12] which spend 13.5 ms and 3.399 ms in the _ResizeNearestKernel kernel respectively (both also with scale [1, 1, 2, 2, 2]).

Screenshot 2023-02-27 at 10 15 07

Here are the kernel times cropped to one inference run (after warmup): onnx-profile.ods

SimonRelu avatar Feb 27 '23 09:02 SimonRelu

hi @zhanghuanrong , thank you for looking into this. I was just wondering if you could provide an update on this. It'd be great if you could give an estimate on how long this would take since we're currently blocked by this for our release. Thanks a lot!

SimonRelu avatar Mar 02 '23 11:03 SimonRelu

Hi! Any updates on this? I ran into a similar issue when using ONNX for a network with 3D convolutions and resizings coming from pytorch.

a3VonG avatar Mar 08 '23 13:03 a3VonG

Hey, I was wondering if there is any news on this. We'd like to move foreward with our release soon so if it's not possible to fix this, I'd like to know so I can look into an alternative solution. If there is anything you still need for me to better reproduce the issue please let me know. Thank you for all the info so far.

SimonRelu avatar Mar 14 '23 13:03 SimonRelu

@zhanghuanrong, is there update on this? ORT is about 6x slower based on the profiling in T4 GPU:

Profiling of Torch:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   74.59%  1.44478s       103  14.027ms  8.2769ms  14.859ms  void at::native::_GLOBAL__N__1f4b89ac_20_UpSampleNearest3d_cu_9b21834e::upsample_nearest3d_out_frame<float, __operator_&__(at::native:                        :nearest_neighbor_compute_source_index(float, int, int))>(float const *, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, at::native::neares                        t_neighbor_compute_source_index*, float, float, float)
                   25.41%  492.06ms       101  4.8719ms  4.6421ms  5.4845ms  [CUDA memcpy HtoD]
      API calls:   86.11%  1.53290s       101  15.177ms  4.6994ms  1.03680s  cudaMemcpyAsync
                   12.66%  225.30ms         4  56.325ms  2.7260us  225.29ms  cudaStreamIsCapturing
                    0.46%  8.1186ms       101  80.382us  77.680us  84.393us  cudaStreamSynchronize

Profiling result of ORT:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   94.60%  8.79115s       103  85.351ms  84.858ms  96.348ms  void onnxruntime::cuda::_ResizeNearestKernel<float>(int, onnxruntime::cuda::TArray<long, int=8>, onnxruntime::cuda::_ResizeNearestKernel<float, onnxruntime::cuda::fast_divmod, int=8>, float const *, onnxruntime::cuda::_ResizeNearestKernel<float, onnxruntime::cuda::fast_divmod, int=8>*, unsigned long, onnxruntime::cuda::_ResizeNearestKernel<float, onnxruntime::cuda::fast_divmod, int=8>, long const *, onnxruntime::cuda::NearestMappingInfo const *)
                    5.30%  492.96ms       102  4.8329ms  4.6511ms  5.2684ms  [CUDA memcpy HtoD]
                    0.00%  388.48us         1  388.48us  388.48us  388.48us  [CUDA memcpy DtoD]
                    0.00%  349.34us       103  3.3910us  3.2320us  7.2640us  void onnxruntime::cuda::_ResizeNearestMappingKernel<float, onnxruntime::cuda::TransformCoordinate_ASYMMETRIC, onnxruntime::cuda::NearestPixel_FLOOR>(unsigned long, onnxruntime::cuda::TArray<long, int=8>, onnxruntime::cuda::TArray, onnxruntime::cuda::_ResizeNearestMappingKernel<float, onnxruntime::cuda::TransformCoordinate_ASYMMETRIC<float, int                        =8>, onnxruntime::cuda::NearestPixel_FLOOR>, onnxruntime::cuda::_ResizeNearestMappingKernel<float, onnxruntime::cuda::TransformCoordinate_ASYMMETRIC<float, int=10>, onnxruntime::cuda::NearestPixel_FLOOR>, unsign                        ed long, bool, onnxruntime::cuda::TransformCoordinate_ASYMMETRIC const &, onnxruntime::cuda::NearestPixel_FLOOR const &, long*, onnxruntime::cuda::NearestMappingInfo*)

Example script for profiling:

import torch
import time

def test_torch():
    input = torch.rand(1, 128, 48, 48, 48, dtype=torch.float32, requires_grad=False).to('cuda')
    model = torch.nn.Upsample(scale_factor=2, mode='nearest').cuda()

    torch.set_grad_enabled(False)

    for _ in range(3): # warm up
        model(input)
    
    latency = []
    for _ in range(100):
        input = torch.rand(1, 128, 48, 48, 48, dtype=torch.float32, requires_grad=False).to('cuda') # input is in GPU
        torch.cuda.synchronize()
        start = time.time()
        output = model(input)
        torch.cuda.synchronize()
        end = time.time()
        latency.append(end-start)
    average_latency_in_ms = 1000 * sum(latency) / len(latency)
    print(f"Torch latency: {average_latency_in_ms:.2f} ms")

def export_onnx():
    input = torch.rand(1, 128, 48, 48, 48, dtype=torch.float32, requires_grad=False).to('cuda')
    model = torch.nn.Upsample(scale_factor=2, mode='nearest').cuda()
    torch.onnx.export(model, input, "upsample.onnx", input_names=["input"], output_names=["output"], opset_version=11)

def test_onnx():
    import onnxruntime
    import numpy as np
    options = onnxruntime.SessionOptions()
    session = onnxruntime.InferenceSession("upsample.onnx", options, providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

    input = torch.rand(1, 128, 48, 48, 48, dtype=torch.float32, requires_grad=False).to('cuda')
    input_tensor = input.contiguous()
    output_tensor = torch.empty((1, 128, 96, 96, 96), dtype=torch.float32, device='cuda:0').contiguous()

    binding = session.io_binding()
    binding.bind_input(
        name='input',
        device_type='cuda',
        device_id=0,
        element_type=np.float32,
        shape=tuple(input_tensor.shape),
        buffer_ptr=input_tensor.data_ptr(),
        )
    binding.bind_output(
        name='output',
        device_type='cuda',
        device_id=0,
        element_type=np.float32,
        shape=tuple(output_tensor.shape),
        buffer_ptr=output_tensor.data_ptr(),
    )

    for _ in range(3): # warm up
        session.run_with_iobinding(binding)

    ort_latency = []
    for _ in range(100):
        input = torch.rand(1, 128, 48, 48, 48, dtype=torch.float32, requires_grad=False).to('cuda') # input is in GPU
        input_tensor = input.contiguous()
        start = time.time()
        binding.bind_input(
                name='input',
                device_type='cuda',
                device_id=0,
                element_type=np.float32,
                shape=tuple(input_tensor.shape),
                buffer_ptr=input_tensor.data_ptr(),
                )
        session.run_with_iobinding(binding)
        end = time.time()
        ort_latency.append(end - start)

    average_latency_in_ms = 1000 * sum(ort_latency) / len(ort_latency)
    print(f"Ort Latency: {average_latency_in_ms:.2f} ms")
    
#test_torch()
#export_onnx()
test_onnx()

I think one solution is to use the aten kernel for 3d upsample. @yufenglee to track this work.

tianleiwu avatar Mar 15 '23 07:03 tianleiwu

I have converted a Bert variant model which onnx file more faster then pytorch model ORT into geforce rtx 3090. Below the result for 3090. but when I used Tesla T4 GPU the onnx model was slow then pytorch model.

| Text Length | PyTorch Time | ONNX Time | Percentage Improvement |
|-------------|--------------|-----------|------------------------|
| 79          | 0.714        | 0.230     | 67.86%                 |
| 177         | 0.183        | 0.098     | 46.43%                 |
| 78          | 0.293        | 0.060     | 79.45%                 |

This table presents the processing times for both PyTorch and ONNX models along with the percentage improvement in processing time when using the ONNX model over PyTorch for each respective text length.

Could You Please explain the main reason for onnx model slow into T4 GPU.

This problem has solved CUDA 12.1 and Pytroch ==2.2.1 .

saiful9379 avatar May 16 '24 19:05 saiful9379

I've updated the title since indd overall, onnxruntime seems to be faster but the Rescale operator for 3D tensors is significantly slower than the one in pytorch.

The reason for this is because in pytorch, there is a dedicated function for doing this for 3D (as well as 2D) tensors. In onnxruntime, only a dedicated function for 2D tensors is present. Any other tensor is handled by a more generic function which is quite a bit slower since a lot less is know at compile time leading to less optimisations: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cuda/tensor/upsample_impl.cu

SimonRelu avatar May 16 '24 22:05 SimonRelu

@SimonRelu, We've added it to the backlog targeted for 1.19 release. If you have done some improvements on the kernel, feel free to submit a pull request to contribute.

tianleiwu avatar May 20 '24 16:05 tianleiwu

Thank you for adding it to the milestone. I ported the pytorch implementation as a custom operator. It is however quite hacky and only works for our use case so I'm not sure if it's useful.

The pytorch implementation can be found here: https://github.com/pytorch/pytorch/blob/acfe237a71af609e837a34bb38048aa8acb8eb4d/aten/src/ATen/native/cuda/UpSampleNearest3d.cu#L4

SimonRelu avatar May 20 '24 21:05 SimonRelu

Hey, I was wondering if there is an update on the status of this issue? Was this resolved in 1.19? We might want to upgrade our onnxruntime version and it would be good to know if this was resolved

SimonRelu avatar Mar 26 '25 10:03 SimonRelu

@SimonRelu, it is not implemented yet. It is still in back log due to resource limit, and I'll update if there is progress.

tianleiwu avatar Mar 26 '25 17:03 tianleiwu