stable-diffusion-webui [Feature Request]: Potential mode for TensorRT inference over native PyTorch

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

As shown on the the SDA: Node repository. Supported NVIDIA systems can achieve inference speeds up to x4 over native pytorch utilising NVIDIA TensorRT.

Their demodiffusion.py file and text to image file (t2i.py) provides a good example of how this is used. This is mainly based off the Nvidia diffusion demo folder.

This speedup happens through compiling it into a highly optimised version that can be run on Nvidia GPUs, e.g. https://huggingface.co/tensorrt/Anything-V3/tree/main

This splits out the CLIP, UNET and VAE into .plan files, these are serialized TensorRT engine files which contain the parameters of the optimized model. Running these through the TensorRT runtime provides additional restrictions on resolution and batch size as talked about on the SDA: Node README.

Usage is dependent on the following (https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html):

A NVIDIA TensorRT supported GPU
CUDA (https://developer.nvidia.com/cuda-toolkit)
- v11.3+ is compatible with the latest CUDNN
CUDNN (https://developer.nvidia.com/cudnn), *requires Nvidia developer account to download this.
TensorRT (https://developer.nvidia.com/tensorrt) zip for windows, or pypi index tensorrt for linux.

Examples of Implementations:

https://github.com/NVIDIA/TensorRT/tree/main/demo/Diffusion
https://github.com/chavinlo/sda-node
https://github.com/VoltaML/voltaML-fast-stable-diffusion
https://github.com/stochasticai/x-stable-diffusion
https://github.com/ddPn08/Lsmith

I've managed to build sda-node for Linux and test TensorRT on Windows and can confirm around a ~x3 speedup on my own system compared to inference in AUTOMATIC-1111.

Implementation would be dependent on loading model(s) .plan files into a runtime with the TensorRT engine and plugins loaded, including synchronizing CUDA and PyTorch, etc.

I'm unsure if this would be in-scope for this project. Potentially too much stuff that would not run on this runtime, meaning it'd make more sense to just use a separate UI when wanting to utilise NVIDIA TensorRT models. People may find a number of features they'd expect to work no longer work on the tensor runtime. It's also another mode of operation that is hard to support and will need to be maintained. Just proposing in a cleaner way before anyone else does, it could be this is already in the works :-)

Proposed workflow

As I'm unsure of this project's structure and goals. I'm unsure of the viability or implementation path.

Possibly a inbuilt-extension TensorRT enabled by a possible flag say -TENNSORRT
Place models (consisting of a model index file and .plan files) into a supported folder (could be inside of /models/ or /inbuilt-extensions/tensorrt/models)
Fail/flag up on an unsupported system
Any possible automatic dependencies are installed during the dependency check
Any other dependencies that don't fit into this category are prompted with links to be manually installed/setup
Features not supported by the TensorRT runtime, unsure. Likely disabled?
Text2Image as normal, but with a notable performance improvement.

Additional information

I'll likely be messing around with my own a bit with this, but I have limited knowledge with pytorch/stable diffusion/cuda, so I wouldn't expect any MR from myself.

Jan 29 '23 00:01 elliotgsy

Hello. May I ask how did you got it to run on windows? Were the OSS plugins required?

Jan 31 '23 06:01 chavinlo

I'm curious if this would affect the actual generated image, since it says it optimizes CLIP which can really change how a gen turns out depending on the value. Would still love to have as an option though. 60 steps in under a second is crazy.

Feb 08 '23 17:02 NikkMann

Note that a working version of inference with TensorRT on windows is shown here: https://github.com/ddPn08/Lsmith

Another note is that these plan files are built to work specifically to your gpu/libraries in use, so users need to give each model they want to use ~10mins of time to let it build and compile into a .plan file.

Feb 08 '23 18:02 elliotgsy

Note that a working version of inference with TensorRT on windows is shown here: https://github.com/ddPn08/Lsmith

Another note is that these plan files are built to work specifically to your gpu/libraries in use, so users need to give each model they want to use ~10mins of time to let it build and compile into a .plan file.

I installed the docker release of Lsmith and i can confirm, it works well. Lacks most features automatic1111's UI has. But BOY is it fast. If i knew how to help to implement, i would help. If there are some tasks that i could do to help, i'd get on it of my free time. Because when running batches on a1111 i got ~15 it/s. On Lsmith running single image generations i got 35 it/s. This would be a big improvement if it got implemented. For now i wont be using Lsmith because it's quite raw. There are resolution limits (1024x1024), no hires fix or face restore. https://i.imgur.com/5FDD7Yb.png

Haven't tried windows native install yet, but will do as soon as i get time.