tomesd
tomesd copied to clipboard
Cannot reproduce the results of speedup
trafficstars
I implemented a script to test the acceleration by myself and tested it on 3090GPU, but it could not achieve the acceleration effect of Table 4 in the paper.
import torch, tomesd, random, time
import numpy as np
from diffusers import StableDiffusionPipeline, DDIMScheduler, PNDMScheduler
def seed_everything(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)
seed = 2024
seed_everything(seed)
batch_size = 1
num_inference_steps = 50
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
# pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
pipe.scheduler = PNDMScheduler.from_config(pipe.scheduler.config)
prompt = "a photo of an astronaut riding a horse on mars"
generator = torch.Generator().manual_seed(seed)
# warmup gpu for testing time
print('Warm up of the gpu')
for i in range(2):
image = pipe([prompt] * batch_size, num_inference_steps=num_inference_steps)
#-------------------
# origin pipeline
start_time = time.time()
image = pipe([prompt] * batch_size, num_inference_steps=num_inference_steps, generator=generator).images[0]
end_time = time.time()
print("Origin Pipeline: {:.3f} seconds".format(end_time-start_time))
image.save("results/orgin.png")
# Apply ToMe with a 50% merging ratio
generator = torch.Generator().manual_seed(seed)
tomesd.apply_patch(pipe, ratio=0.5) # Can also use pipe.unet in place of pipe here
start_time = time.time()
image = pipe([prompt] * batch_size, num_inference_steps=num_inference_steps, generator=generator).images[0]
end_time = time.time()
print("ToMe: {:.3f} seconds".format(end_time-start_time))
image.save("results/orgin_ToMe.png")
The output from this script is:
Origin Pipeline: 2.728 seconds
ToMe: 2.581 seconds
In Table 4, the speedup is almost doubled when r=50. Why is this the case in my test? Could you give me some advice?
- You might not be giving your GPU enough work. Try increasing the batch size and image size and see what happens.
- The results in the paper were done using the original stable diffusion repo, not diffusers. Diffusers itself has implemented a bunch of different optimizations that eat into the speed-up of ToMe, but you should still see gains at bigger image sizes.
Does this still work in 2024? with the xl or Lightning models?
Tested on XL, and accidentally found that merging middle blocks (1024 dimensions) give much better results (although no big speedups)