stable-diffusion-webui LDM optimization patches

Description

Change 1: Timestep Embedding Patch

Fixes a blocking op in the timestep embedding. It was creating a tensor on CPU and then moving it to GPU, which would force a sync every step.
Combined with the other performance PRs (mine and HCL's), Torch's dispatch queue should be completely unblocked (until extensions with similar problems mess it up). This will allow near constant 100% GPU usage.

Change 2: SpatialTransformer.forward einops removal

Changes the function to use native torch reshape/view/permute ops and removes the .contiguous() call.
Prevents 32 calls to aten::copy_ and void at::native::elementwise_kernel<128, 4, at::nati... per forward pass (SD 1.5). Speedup seems to be around 6-8 ms per forward, but my profiler is being a little inconsistent with the timing (512x512, batch 4, overclocked 3090)

May 17 '24 16:05 drhead

I think #18620 might need to be merged before tests will pass on this.

May 17 '24 16:05 drhead

so we need to wait 2769 new posts to merge this 🙃

May 17 '24 16:05 w-e-w

Upon further review I think it would be sufficient for #15820 to be merged first lol

May 17 '24 16:05 drhead

Added another patch, and it passes tests now.

May 17 '24 17:05 drhead