onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[Performance] cuda graphs optimization refuses to apply to a cuda provider model

Open pommedeterresautee opened this issue 3 years ago • 3 comments

Describe the issue

I want to apply cuda graphs optimization to a tranformer model running on cuda provider. Fallback provider is disabled so only cuda provider should work (in my understanding).

Unfortunately, I got this issue:

test/test_torchdynamo_bert.py:42: in <module>
    _ = get_bert_onnx()
test/models/bert.py:25: in get_bert_onnx
    return get_model_onnx(model_name, models_dir)
utils/modeling_utils.py:34: in get_model_onnx
    model_onnx = build_onnx(model_name, model_path)
utils/onnx_utils.py:27: in build_onnx
    onnx_model = create_model_for_provider(onnx_path.as_posix())
utils/ort_utils.py:30: in create_model_for_provider
    session = InferenceSession(path, options, providers=[("CUDAExecutionProvider", {'do_copy_in_default_stream': False,
/home/geantvert/workspace/triton-xp/venv/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:347: in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
/home/geantvert/workspace/triton-xp/venv/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:395: in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
E   onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : This session cannot use the CUDA Graph feature as requested by the user  as all the graph nodes have not been partitioned to the CUDA EP.

I don't understand how this exception can be raised as I declare only one provider and forbid cpu one.

Model is Bert from HF converted in onnx using HF tooling (no fancy stuff).

To reproduce

Code to launch a session looks like that:

def create_model_for_provider(
        path: str,
) -> InferenceSession:
    print("Loading model from", path)
    options = SessionOptions()
    options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_EXTENDED
    session = InferenceSession(path, options, providers=[("CUDAExecutionProvider", {'enable_cuda_graph': True
                                                                                    })])
    session.disable_fallback()
    return session

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12.1

ONNX Runtime API

Python

Architecture

X86

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.4

Model File

https://drive.google.com/file/d/1ppNZwLA1WxGwaL7OUhnNaC54OVmVkWiv/view?usp=sharing

Is this a quantized model?

No

pommedeterresautee avatar Sep 15 '22 09:09 pommedeterresautee

In ORT some ops like "shape" will be placed in CPU even if the model is running with CUDA EP. Since it's a standard huggingface bert model, you may want to try transformers optimizer benchmark (https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/README.md#benchmark) to get an optimized huggingface bert model where the graph is simplified and there should be no op placed on CPU during runtime.

wangyems avatar Sep 15 '22 17:09 wangyems

The error message is related to {'enable_cuda_graph': True}. It is an advanced feature and it cannot be applied to every model due to limitations: https://natke.github.io/onnxruntime/docs/performance/tune-performance.html#using-cuda-graphs-in-the-cuda-ep

Try remove it or update the line like the following: session = InferenceSession(path, options, providers=[("CUDAExecutionProvider", 'CPUExecutionProvider'])

tianleiwu avatar Sep 21 '22 21:09 tianleiwu

I understand that it s because of cuda graphs and inference works without enabling it. I wanted to test with cuda graphs, even tried optimized model but got no luck, always have this issue. What I don’t understand is that I forbid the fallback on cpu…

pommedeterresautee avatar Sep 22 '22 05:09 pommedeterresautee

Hi @pommedeterresautee,

Even though you choose the CUDA EP, the core runtime force place some shape massaging nodes onto CPU becauase it is counter-productive to hardware accelerate these ops and when CUDA EP sees that some nodes have been placed on CPU, it flags it and the load fails. Even if we did force these shape massaging nodes onto CUDA, it is of no use as that will introduce copy nodes to take the shape information from host to device (the shape of a tensor is stored on CPU) and this again means we cannot use CUDA Graphs as the copy operation cannot be stream captured.

As you can see, CUDA Graphs only works for very specific kinds of models and at this point, the coverage of models for CUDA Graphs is quite low because of this.

CC: @feihugis

hariharans29 avatar Sep 26 '22 17:09 hariharans29

Thank you @hariharans29 for your answer its very clear!

pommedeterresautee avatar Sep 27 '22 21:09 pommedeterresautee