onnxruntime [Performance] cudaMemcpyAsync dominates runtime for batch inference with FP32 inputs

Describe the issue

Describe

When running a batch (batch=6, size=[6*3*1280*1280]) FP32 inference on GPU (EP: CUDA Provider) with ONNX Runtime, the majority of the time is spent in cudaMemcpyAsync inside session.Run(). Even though the GPU utilization increases, the total inference time scales almost linearly with batch size.

CUDA & cuDNN: 11.8 with 8.9.5 ONNX model: dynamic batch YOLOv9 (FP32, input shape [batch,3,1280,1280])

Note: Initially suspected hardware/CUDA compatibility, but same behavior occurs on RTX 5080 with CUDA 12.5 and cuDNN 8.9.7

Observed behavior

Request / Question

Can ONNX Runtime improve Run() performance for this batches where input memory copies dominate the total time? Is this behavior expected for dynamic batch models, or should this batch inference be faster?

Thanks for your time and support. I really appreciate any guidance or suggestions to improve ONNX Runtime performance for small-batch GPU inference.

To reproduce

CreateBatchTensor

  private static Tensor<float> CreateBatchTensor(
    List<Mat> mats, int[] dims, string format, bool normalize)
{
    if (dims.Length != 4)
        throw new Exception("dims mismatch.");

    int batch = dims[0];
    int channels = dims[1];
    int targetHeight = dims[2];
    int targetWidth = dims[3];

    if (batch != mats.Count)
        throw new Exception($"batch mismatch! dims[0]={batch}, but mats.Count={mats.Count}");

    int targetTensorSize = batch * channels * targetHeight * targetWidth;
    var allValues = new float[targetTensorSize];

    for (int b = 0; b < batch; b++)
    {
        Mat mat = mats[b];
        if (mat.Empty())
            throw new Exception($"input image {b} is empty.");

        // Ensure float32
        Mat matRef = new Mat();
        if (mat.Type() != MatType.CV_32FC(mat.Channels()))
            mat.ConvertTo(matRef, MatType.CV_32FC(mat.Channels()));
        else
            matRef = mat;

        // BGR -> RGB
        Mat rgbMat = new Mat();
        Cv2.CvtColor(matRef, rgbMat, ColorConversionCodes.BGR2RGB);

        // Normalize
        Mat targetMat = new Mat();
        rgbMat.ConvertTo(targetMat, MatType.CV_32FC3, normalize ? 1.0 / 255.0 : 1.0);

        // Resize
        Mat resizeMat = new Mat();
        Cv2.Resize(targetMat, resizeMat, new Size(targetWidth, targetHeight));

        if (format == "CHW")
        {
            // Split channels
            Mat[] matChannels = Cv2.Split(resizeMat);
            for (int c = 0; c < channels; c++)
            {
                float[] channelData = new float[targetHeight * targetWidth];
                Marshal.Copy(matChannels[c].Data, channelData, 0, channelData.Length);

                // Copy into batch buffer
                int offset = b * channels * targetHeight * targetWidth + c * targetHeight * targetWidth;
                Array.Copy(channelData, 0, allValues, offset, channelData.Length);

                matChannels[c].Dispose();
            }
        }
        else
        {
            // HWC
            float[] rawData = new float[targetHeight * targetWidth * channels];
            Marshal.Copy(resizeMat.Data, rawData, 0, rawData.Length);

            int offset = b * channels * targetHeight * targetWidth;
            Array.Copy(rawData, 0, allValues, offset, rawData.Length);
        }
    }

    return new DenseTensor<float>(allValues, dims);
}

Run

// Inference
IDisposableReadOnlyCollection<DisposableNamedOnnxValue> outputTensors;
var sw = Stopwatch.StartNew();
try
{
    outputTensors = session_.Run(
        new[] { NamedOnnxValue.CreateFromTensor<float>(inputNames[0], inputTensor) },
        outputNames
    );
}
catch (OnnxRuntimeException e)
{
    LogError(func, e.Message);
    return false;
}
sw.Stop();
LogInfo(func, "Inference time consume : " + sw.Elapsed.TotalSeconds + " s.");

Session option

// Configure CUDA execution provider options
var cudaOptions = new OrtCUDAProviderOptions();
cudaOptions.UpdateOptions(new Dictionary<string, string>
{
    { "device_id", deviceId.ToString() },             // GPU device ID
    { "arena_extend_strategy", "kNextPowerOfTwo" },   // Memory allocation growth strategy
    { "cudnn_conv_algo_search", "HEURISTIC" },        // cuDNN algorithm search strategy
    { "do_copy_in_default_stream", "1" }              // Perform memory copy in the default CUDA stream
});

// Append CUDA execution provider to session options
sessionOps_.AppendExecutionProvider_CUDA(cudaOptions);

Model pre-warm: run once before timing to exclude first-run overhead

Urgency

No response

Platform

Windows

OS Version

window10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.0

ONNX Runtime API

C#

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA & cuDNN: 11.8 with 8.9.5 (NVIDIA RTX 1660 SUPER)

Model File

No response

Is this a quantized model?

No

Aug 26 '25 05:08 GilbertPan97

Did you try to save the optimized model to disk? If most of the time is spent in cudaMemcpy, I would expect to see additional nodes moving data between CPU and GPU.

Sep 01 '25 12:09 xadupre

Did you try to save the optimized model to disk? If most of the time is spent in cudaMemcpy, I would expect to see additional nodes moving data between CPU and GPU.

I already tried running inference with the optimized YOLO-v9 model (input shape [batch, 3, 1280, 1280]), but unfortunately it didn’t have any effect. Does ONNX Runtime support printing the execution time of individual nodes? That way I could identify which part is causing the performance bottleneck.

Sep 05 '25 01:09 GilbertPan97

I've experienced exact same thing like you. If you bind your input tensor and output tensor with Ort::IOBinding, the cudamemcopyasync block is changed to cudaStreamSynchronize. This means, the long-time consuming block is your real inference time. (cudamemcopyasync act as async, but have to wait inference output) cuda launches kernel funtions asynchronously, they don't wait until a kernel is finished. After inserting all kernels in the cuda stream(in microseconds), program context waits until ort::session's run() function return it's value, along inference time. If you want to decrease the time, You have to optimize parallel cuda kernel itself not pipeline. There are many ways to optimize CNN calculation like GEMM, im2col, winograd, etc

Dec 08 '25 14:12 sejunkwonme