Initial Delay in Image Generation with Flux Schnell on H100
Thank you very much for your incredible work @aredden!
I wanted to ask you about something I've noticed when using Flux Schnell with an H100. (when using compile_extras and compile_blocks) After running the three warmups of Flux Schnell, the first image I generate takes about 45 seconds to start the first iteration, but the subsequent images generate quickly. Is this normal? Is there any way to avoid this initial delay?
I appreciate your help in advance.
The slowdown is due to the torch.compile compilation, it should speed up after that, but the initial generation may take a while, and also may take a while for each new requested image shape. The initial slowdown is much more reasonable with torch nightly, or just torch > 2.4.x, since I believe they made it quite a bit faster, or at least it is faster on my machine. I barely notice compilation speed anymore, though I have a beefy computer so there is that.
Thanks so much for your reply! I really appreciate it.
I have tested this slowdown on h100 and rtx4090. The slowdown is around 1 minute for just torch and for torch nightly its around 3-7 seconds
Yeah- so essentially using nightly is significantly better.
I'm still experiencing a slowdown with the initial compilation on a H100 with torch nightly builds (2.6.0.dev20240918+cu124) Based on the previous comments here ... that should not happen right. Any thoughts on why this can happen?
I think it depends. Sometimes compilation will be more costly than others depending on torch version. I think at the time, nightly was 2.5.0 or 2.5.1, I'm not sure. So, it could be that you may only need one of those two for fastest compile time.