stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Feature Request]: Potential mode for TensorRT inference over native PyTorch
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What would your feature do ?
As shown on the the SDA: Node repository. Supported NVIDIA systems can achieve inference speeds up to x4 over native pytorch utilising NVIDIA TensorRT.
Their demodiffusion.py file and text to image file (t2i.py) provides a good example of how this is used. This is mainly based off the Nvidia diffusion demo folder.
This speedup happens through compiling it into a highly optimised version that can be run on Nvidia GPUs, e.g. https://huggingface.co/tensorrt/Anything-V3/tree/main
This splits out the CLIP, UNET and VAE into .plan
files, these are serialized TensorRT engine files which contain the parameters of the optimized model. Running these through the TensorRT runtime provides additional restrictions on resolution and batch size as talked about on the SDA: Node README.
Usage is dependent on the following (https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html):
- A NVIDIA TensorRT supported GPU
- CUDA (https://developer.nvidia.com/cuda-toolkit)
- v11.3+ is compatible with the latest CUDNN
- CUDNN (https://developer.nvidia.com/cudnn), *requires Nvidia developer account to download this.
- TensorRT (https://developer.nvidia.com/tensorrt) zip for windows, or pypi index
tensorrt
for linux.
Examples of Implementations:
- https://github.com/NVIDIA/TensorRT/tree/main/demo/Diffusion
- https://github.com/chavinlo/sda-node
- https://github.com/VoltaML/voltaML-fast-stable-diffusion
- https://github.com/stochasticai/x-stable-diffusion
- https://github.com/ddPn08/Lsmith
I've managed to build sda-node
for Linux and test TensorRT on Windows and can confirm around a ~x3 speedup on my own system compared to inference in AUTOMATIC-1111
.
Implementation would be dependent on loading model(s) .plan
files into a runtime with the TensorRT engine and plugins loaded, including synchronizing CUDA and PyTorch, etc.
I'm unsure if this would be in-scope for this project. Potentially too much stuff that would not run on this runtime, meaning it'd make more sense to just use a separate UI when wanting to utilise NVIDIA TensorRT models. People may find a number of features they'd expect to work no longer work on the tensor runtime. It's also another mode of operation that is hard to support and will need to be maintained. Just proposing in a cleaner way before anyone else does, it could be this is already in the works :-)
Proposed workflow
As I'm unsure of this project's structure and goals. I'm unsure of the viability or implementation path.
- Possibly a inbuilt-extension
TensorRT
enabled by a possible flag say-TENNSORRT
- Place models (consisting of a model index file and .plan files) into a supported folder (could be inside of
/models/
or/inbuilt-extensions/tensorrt/models
) - Fail/flag up on an unsupported system
- Any possible automatic dependencies are installed during the dependency check
- Any other dependencies that don't fit into this category are prompted with links to be manually installed/setup
- Features not supported by the TensorRT runtime, unsure. Likely disabled?
- Text2Image as normal, but with a notable performance improvement.
Additional information
I'll likely be messing around with my own a bit with this, but I have limited knowledge with pytorch/stable diffusion/cuda, so I wouldn't expect any MR from myself.
Hello. May I ask how did you got it to run on windows? Were the OSS plugins required?
I'm curious if this would affect the actual generated image, since it says it optimizes CLIP which can really change how a gen turns out depending on the value. Would still love to have as an option though. 60 steps in under a second is crazy.
Note that a working version of inference with TensorRT on windows is shown here: https://github.com/ddPn08/Lsmith
Another note is that these plan files are built to work specifically to your gpu/libraries in use, so users need to give each model they want to use ~10mins of time to let it build and compile into a .plan file.
Note that a working version of inference with TensorRT on windows is shown here: https://github.com/ddPn08/Lsmith
Another note is that these plan files are built to work specifically to your gpu/libraries in use, so users need to give each model they want to use ~10mins of time to let it build and compile into a .plan file.
I installed the docker release of Lsmith and i can confirm, it works well. Lacks most features automatic1111's UI has. But BOY is it fast. If i knew how to help to implement, i would help. If there are some tasks that i could do to help, i'd get on it of my free time. Because when running batches on a1111 i got ~15 it/s. On Lsmith running single image generations i got 35 it/s. This would be a big improvement if it got implemented. For now i wont be using Lsmith because it's quite raw. There are resolution limits (1024x1024), no hires fix or face restore. https://i.imgur.com/5FDD7Yb.png
Haven't tried windows native install yet, but will do as soon as i get time.
Still interest in this? I definitely am.
Yes, i am trying and mostly its providing x2 speed on my laptot 3070
@AUTOMATIC1111 Would it be possible to add this to the current code?
Has anyone compared TensorRT and Olive (Direct-ML)? They advertise similar performance.