optimum
optimum copied to clipboard
WIP: end-to-end ONNX export and inference for stable diffusion
A work in progress extension of the stable diffusion deployment through the ONNX + ONNX Runtime deployment path, allowing to perform the end-to-end pipeline through a single InferenceSession, with the exception of tokenization and generating the timesteps, that is done ahead of time.
From my preliminar tests, the ONNX export and inference through CUDAExecutionProvider works nicely:

Major remaining issues:
- Inference through
CPUExecutionProvideryields garbage - Memory usage for a single-batch inference with
CUDAExecutionProvideris huge, up to 20 GB, although the model is < GB and PyTorch inference takes ~5 GB during inference and the same setting.
From the two major issues above, it's clear that this POC is not usable as is. On top of fixing them, there would remain quite a bit of work:
- Separate more clearly the schedulers and pipeline implementation. Currently there's a hack that mix both of them, but it's very bad for generality.
- Support other pipelines than text2img (e.g. support as well img2img).
- Test on a larger variety of models, currently this is only stable-diffusion 1.4.
- Test with
TensorrtExecutionProvider==> I expect to have issues there as well, as in my experience the Loop / If support from TensorRT can be buggy: https://github.com/onnx/onnx-tensorrt/issues/891 I'm not sure how much NVIDIA folks are interested by TensorrtExecutionProvider to be honest. - Test in fp16 (if it's possible with ONNX Runtime and this kind of complex models)
- Support passing
widthandheightas inputs, or alternatively, to have them as constants that can be modified. - Support passing
guidance_scaleas a model input. - Test out with
num_inference_steps != 50.
Longer term goals once all of this is tested:
- Integration with
optimum.exporters - Possibly, have an
ORTStableDiffusionEndToEndPipelineor something like this, that is PyTorch-free.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
Inference through CPUExecutionProvider yields garbage
is due to a bug in FusedConv in ONNX Runtime, tracked in https://github.com/microsoft/onnxruntime/issues/14500
Memory usage for a single-batch inference with CUDAExecutionProvider is huge
Tracked here: https://github.com/microsoft/onnxruntime/issues/14526 . ONNX Runtime seems to be very greedy compared to PyTorch when it comes to GPU memory. From what I tried, CUDAExecutionProvider is basically unusable for stable diffusion currently.
torch.jit.trace is pretty much unusable with deep loop: https://github.com/pytorch/pytorch/issues/93943 I'll just go on with torch.jit.scrit.
When I run the code of this branch. I can generate single pt, onnx files, but using the same input to run onnx model multiple times, the output results are different. Why is this, are there any random parameters that need to be set fixed?
python run_ort.py --gpu
This PR has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.