stable-diffusion-webui
stable-diffusion-webui copied to clipboard
LDM optimization patches
Description
Change 1: Timestep Embedding Patch
- Fixes a blocking op in the timestep embedding. It was creating a tensor on CPU and then moving it to GPU, which would force a sync every step.
- Combined with the other performance PRs (mine and HCL's), Torch's dispatch queue should be completely unblocked (until extensions with similar problems mess it up). This will allow near constant 100% GPU usage.
Change 2: SpatialTransformer.forward einops removal
- Changes the function to use native torch reshape/view/permute ops and removes the
.contiguous()call. - Prevents 32 calls to
aten::copy_andvoid at::native::elementwise_kernel<128, 4, at::nati...per forward pass (SD 1.5). Speedup seems to be around 6-8 ms per forward, but my profiler is being a little inconsistent with the timing (512x512, batch 4, overclocked 3090)
Checklist:
- [x] I have read contributing wiki page
- [x] I have performed a self-review of my own code
- [x] My code follows the style guidelines
- [x] My code passes tests
I think #18620 might need to be merged before tests will pass on this.
- we are currently on #15824
so we need to wait 2769 new posts to merge this 🙃
Upon further review I think it would be sufficient for #15820 to be merged first lol
Added another patch, and it passes tests now.