onnxruntime
onnxruntime copied to clipboard
[Performance] cuda graphs optimization refuses to apply to a cuda provider model
Describe the issue
I want to apply cuda graphs optimization to a tranformer model running on cuda provider. Fallback provider is disabled so only cuda provider should work (in my understanding).
Unfortunately, I got this issue:
test/test_torchdynamo_bert.py:42: in <module>
_ = get_bert_onnx()
test/models/bert.py:25: in get_bert_onnx
return get_model_onnx(model_name, models_dir)
utils/modeling_utils.py:34: in get_model_onnx
model_onnx = build_onnx(model_name, model_path)
utils/onnx_utils.py:27: in build_onnx
onnx_model = create_model_for_provider(onnx_path.as_posix())
utils/ort_utils.py:30: in create_model_for_provider
session = InferenceSession(path, options, providers=[("CUDAExecutionProvider", {'do_copy_in_default_stream': False,
/home/geantvert/workspace/triton-xp/venv/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:347: in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
/home/geantvert/workspace/triton-xp/venv/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:395: in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
E onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : This session cannot use the CUDA Graph feature as requested by the user as all the graph nodes have not been partitioned to the CUDA EP.
I don't understand how this exception can be raised as I declare only one provider and forbid cpu one.
Model is Bert from HF converted in onnx using HF tooling (no fancy stuff).
To reproduce
Code to launch a session looks like that:
def create_model_for_provider(
path: str,
) -> InferenceSession:
print("Loading model from", path)
options = SessionOptions()
options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_EXTENDED
session = InferenceSession(path, options, providers=[("CUDAExecutionProvider", {'enable_cuda_graph': True
})])
session.disable_fallback()
return session
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.12.1
ONNX Runtime API
Python
Architecture
X86
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.4
Model File
https://drive.google.com/file/d/1ppNZwLA1WxGwaL7OUhnNaC54OVmVkWiv/view?usp=sharing
Is this a quantized model?
No
In ORT some ops like "shape" will be placed in CPU even if the model is running with CUDA EP. Since it's a standard huggingface bert model, you may want to try transformers optimizer benchmark (https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/README.md#benchmark) to get an optimized huggingface bert model where the graph is simplified and there should be no op placed on CPU during runtime.
The error message is related to {'enable_cuda_graph': True}. It is an advanced feature and it cannot be applied to every model due to limitations: https://natke.github.io/onnxruntime/docs/performance/tune-performance.html#using-cuda-graphs-in-the-cuda-ep
Try remove it or update the line like the following:
session = InferenceSession(path, options, providers=[("CUDAExecutionProvider", 'CPUExecutionProvider'])
I understand that it s because of cuda graphs and inference works without enabling it. I wanted to test with cuda graphs, even tried optimized model but got no luck, always have this issue. What I don’t understand is that I forbid the fallback on cpu…
Hi @pommedeterresautee,
Even though you choose the CUDA EP, the core runtime force place some shape massaging nodes onto CPU becauase it is counter-productive to hardware accelerate these ops and when CUDA EP sees that some nodes have been placed on CPU, it flags it and the load fails. Even if we did force these shape massaging nodes onto CUDA, it is of no use as that will introduce copy nodes to take the shape information from host to device (the shape of a tensor is stored on CPU) and this again means we cannot use CUDA Graphs as the copy operation cannot be stream captured.
As you can see, CUDA Graphs only works for very specific kinds of models and at this point, the coverage of models for CUDA Graphs is quite low because of this.
CC: @feihugis
Thank you @hariharans29 for your answer its very clear!