[Performance] cudaMemcpyAsync dominates runtime for batch inference with FP32 inputs
Describe the issue
Describe
When running a batch (batch=6, size=[6*3*1280*1280]) FP32 inference on GPU (EP: CUDA Provider) with ONNX Runtime, the majority of the time is spent in cudaMemcpyAsync inside session.Run(). Even though the GPU utilization increases, the total inference time scales almost linearly with batch size.
CUDA & cuDNN: 11.8 with 8.9.5 ONNX model: dynamic batch YOLOv9 (FP32, input shape [batch,3,1280,1280])
Note: Initially suspected hardware/CUDA compatibility, but same behavior occurs on RTX 5080 with CUDA 12.5 and cuDNN 8.9.7
Observed behavior
Request / Question
Can ONNX Runtime improve Run() performance for this batches where input memory copies dominate the total time? Is this behavior expected for dynamic batch models, or should this batch inference be faster?
Thanks for your time and support. I really appreciate any guidance or suggestions to improve ONNX Runtime performance for small-batch GPU inference.
To reproduce
CreateBatchTensor
private static Tensor<float> CreateBatchTensor(
List<Mat> mats, int[] dims, string format, bool normalize)
{
if (dims.Length != 4)
throw new Exception("dims mismatch.");
int batch = dims[0];
int channels = dims[1];
int targetHeight = dims[2];
int targetWidth = dims[3];
if (batch != mats.Count)
throw new Exception($"batch mismatch! dims[0]={batch}, but mats.Count={mats.Count}");
int targetTensorSize = batch * channels * targetHeight * targetWidth;
var allValues = new float[targetTensorSize];
for (int b = 0; b < batch; b++)
{
Mat mat = mats[b];
if (mat.Empty())
throw new Exception($"input image {b} is empty.");
// Ensure float32
Mat matRef = new Mat();
if (mat.Type() != MatType.CV_32FC(mat.Channels()))
mat.ConvertTo(matRef, MatType.CV_32FC(mat.Channels()));
else
matRef = mat;
// BGR -> RGB
Mat rgbMat = new Mat();
Cv2.CvtColor(matRef, rgbMat, ColorConversionCodes.BGR2RGB);
// Normalize
Mat targetMat = new Mat();
rgbMat.ConvertTo(targetMat, MatType.CV_32FC3, normalize ? 1.0 / 255.0 : 1.0);
// Resize
Mat resizeMat = new Mat();
Cv2.Resize(targetMat, resizeMat, new Size(targetWidth, targetHeight));
if (format == "CHW")
{
// Split channels
Mat[] matChannels = Cv2.Split(resizeMat);
for (int c = 0; c < channels; c++)
{
float[] channelData = new float[targetHeight * targetWidth];
Marshal.Copy(matChannels[c].Data, channelData, 0, channelData.Length);
// Copy into batch buffer
int offset = b * channels * targetHeight * targetWidth + c * targetHeight * targetWidth;
Array.Copy(channelData, 0, allValues, offset, channelData.Length);
matChannels[c].Dispose();
}
}
else
{
// HWC
float[] rawData = new float[targetHeight * targetWidth * channels];
Marshal.Copy(resizeMat.Data, rawData, 0, rawData.Length);
int offset = b * channels * targetHeight * targetWidth;
Array.Copy(rawData, 0, allValues, offset, rawData.Length);
}
}
return new DenseTensor<float>(allValues, dims);
}
Run
// Inference
IDisposableReadOnlyCollection<DisposableNamedOnnxValue> outputTensors;
var sw = Stopwatch.StartNew();
try
{
outputTensors = session_.Run(
new[] { NamedOnnxValue.CreateFromTensor<float>(inputNames[0], inputTensor) },
outputNames
);
}
catch (OnnxRuntimeException e)
{
LogError(func, e.Message);
return false;
}
sw.Stop();
LogInfo(func, "Inference time consume : " + sw.Elapsed.TotalSeconds + " s.");
Session option
// Configure CUDA execution provider options
var cudaOptions = new OrtCUDAProviderOptions();
cudaOptions.UpdateOptions(new Dictionary<string, string>
{
{ "device_id", deviceId.ToString() }, // GPU device ID
{ "arena_extend_strategy", "kNextPowerOfTwo" }, // Memory allocation growth strategy
{ "cudnn_conv_algo_search", "HEURISTIC" }, // cuDNN algorithm search strategy
{ "do_copy_in_default_stream", "1" } // Perform memory copy in the default CUDA stream
});
// Append CUDA execution provider to session options
sessionOps_.AppendExecutionProvider_CUDA(cudaOptions);
Model pre-warm: run once before timing to exclude first-run overhead
Urgency
No response
Platform
Windows
OS Version
window10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.18.0
ONNX Runtime API
C#
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA & cuDNN: 11.8 with 8.9.5 (NVIDIA RTX 1660 SUPER)
Model File
No response
Is this a quantized model?
No
Did you try to save the optimized model to disk? If most of the time is spent in cudaMemcpy, I would expect to see additional nodes moving data between CPU and GPU.
Did you try to save the optimized model to disk? If most of the time is spent in cudaMemcpy, I would expect to see additional nodes moving data between CPU and GPU.
I already tried running inference with the optimized YOLO-v9 model (input shape [batch, 3, 1280, 1280]), but unfortunately it didn’t have any effect. Does ONNX Runtime support printing the execution time of individual nodes? That way I could identify which part is causing the performance bottleneck.
I've experienced exact same thing like you. If you bind your input tensor and output tensor with Ort::IOBinding, the cudamemcopyasync block is changed to cudaStreamSynchronize. This means, the long-time consuming block is your real inference time. (cudamemcopyasync act as async, but have to wait inference output) cuda launches kernel funtions asynchronously, they don't wait until a kernel is finished. After inserting all kernels in the cuda stream(in microseconds), program context waits until ort::session's run() function return it's value, along inference time. If you want to decrease the time, You have to optimize parallel cuda kernel itself not pipeline. There are many ways to optimize CNN calculation like GEMM, im2col, winograd, etc