optimum WIP: end-to-end ONNX export and inference for stable diffusion

A work in progress extension of the stable diffusion deployment through the ONNX + ONNX Runtime deployment path, allowing to perform the end-to-end pipeline through a single InferenceSession, with the exception of tokenization and generating the timesteps, that is done ahead of time.

From my preliminar tests, the ONNX export and inference through CUDAExecutionProvider works nicely:

Major remaining issues:

Inference through CPUExecutionProvider yields garbage
Memory usage for a single-batch inference with CUDAExecutionProvider is huge, up to 20 GB, although the model is < GB and PyTorch inference takes ~5 GB during inference and the same setting.

From the two major issues above, it's clear that this POC is not usable as is. On top of fixing them, there would remain quite a bit of work:

Separate more clearly the schedulers and pipeline implementation. Currently there's a hack that mix both of them, but it's very bad for generality.
Support other pipelines than text2img (e.g. support as well img2img).
Test on a larger variety of models, currently this is only stable-diffusion 1.4.
Test with TensorrtExecutionProvider ==> I expect to have issues there as well, as in my experience the Loop / If support from TensorRT can be buggy: https://github.com/onnx/onnx-tensorrt/issues/891 I'm not sure how much NVIDIA folks are interested by TensorrtExecutionProvider to be honest.
Test in fp16 (if it's possible with ONNX Runtime and this kind of complex models)
Support passing width and height as inputs, or alternatively, to have them as constants that can be modified.
Support passing guidance_scale as a model input.
Test out with num_inference_steps != 50.

Longer term goals once all of this is tested:

Integration with optimum.exporters
Possibly, have an ORTStableDiffusionEndToEndPipeline or something like this, that is PyTorch-free.

Jan 31 '23 10:01 fxmarty

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Jan 31 '23 11:01 HuggingFaceDocBuilderDev

Inference through CPUExecutionProvider yields garbage

is due to a bug in FusedConv in ONNX Runtime, tracked in https://github.com/microsoft/onnxruntime/issues/14500

Memory usage for a single-batch inference with CUDAExecutionProvider is huge

Tracked here: https://github.com/microsoft/onnxruntime/issues/14526 . ONNX Runtime seems to be very greedy compared to PyTorch when it comes to GPU memory. From what I tried, CUDAExecutionProvider is basically unusable for stable diffusion currently.

Feb 02 '23 10:02 fxmarty

torch.jit.trace is pretty much unusable with deep loop: https://github.com/pytorch/pytorch/issues/93943 I'll just go on with torch.jit.scrit.

Feb 03 '23 09:02 fxmarty

When I run the code of this branch. I can generate single pt, onnx files, but using the same input to run onnx model multiple times, the output results are different. Why is this, are there any random parameters that need to be set fixed?

python run_ort.py --gpu

Sep 08 '23 05:09 philipwan

This PR has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

Sep 04 '25 02:09 github-actions[bot]

optimum optimum copied to clipboard

WIP: end-to-end ONNX export and inference for stable diffusion

optimum
optimum copied to clipboard