flux-fp8-api icon indicating copy to clipboard operation
flux-fp8-api copied to clipboard

Initial Delay in Image Generation with Flux Schnell on H100

Open uayodev opened this issue 1 year ago • 6 comments

Thank you very much for your incredible work @aredden!

I wanted to ask you about something I've noticed when using Flux Schnell with an H100. (when using compile_extras and compile_blocks) After running the three warmups of Flux Schnell, the first image I generate takes about 45 seconds to start the first iteration, but the subsequent images generate quickly. Is this normal? Is there any way to avoid this initial delay?

I appreciate your help in advance.

uayodev avatar Oct 08 '24 22:10 uayodev

The slowdown is due to the torch.compile compilation, it should speed up after that, but the initial generation may take a while, and also may take a while for each new requested image shape. The initial slowdown is much more reasonable with torch nightly, or just torch > 2.4.x, since I believe they made it quite a bit faster, or at least it is faster on my machine. I barely notice compilation speed anymore, though I have a beefy computer so there is that.

aredden avatar Oct 09 '24 14:10 aredden

Thanks so much for your reply! I really appreciate it.

uayodev avatar Oct 11 '24 09:10 uayodev

I have tested this slowdown on h100 and rtx4090. The slowdown is around 1 minute for just torch and for torch nightly its around 3-7 seconds

Muawizodux avatar Oct 17 '24 06:10 Muawizodux

Yeah- so essentially using nightly is significantly better.

aredden avatar Oct 17 '24 23:10 aredden

I'm still experiencing a slowdown with the initial compilation on a H100 with torch nightly builds (2.6.0.dev20240918+cu124) Based on the previous comments here ... that should not happen right. Any thoughts on why this can happen?

lenvoMaster avatar Nov 29 '24 11:11 lenvoMaster

I think it depends. Sometimes compilation will be more costly than others depending on torch version. I think at the time, nightly was 2.5.0 or 2.5.1, I'm not sure. So, it could be that you may only need one of those two for fastest compile time.

aredden avatar Dec 03 '24 15:12 aredden