flux icon indicating copy to clipboard operation
flux copied to clipboard

Inference on TPUs instead of GPUs.

Open kennycoder opened this issue 1 year ago • 1 comments

Hi folks! Our AI Hypercomputer team ported Flux inference implementation to MaxDiffusion and were able to successfully run both Flux-dev and Flux-schnell models using Google's TPUs.

Running tests on 1024 x 1024 images with flash attention and bfloat16 gave the following results:

Model Accelerator Sharding Strategy Batch Size Steps time (secs)
Flux-dev v4-8 DDP 4 28 23
Flux-schnell v4-8 DDP 4 4 2.2
Flux-dev v6e-4 DDP 4 28 5.5
Flux-schnell v6e-4 DDP 4 4 0.8
Flux-schnell v6e-4 FSDP 4 4 1.2

We'd appreciate if you could give us some feedback on the above-mentioned results and our overall approach.

kennycoder avatar Feb 14 '25 15:02 kennycoder

Hi folks! Our AI Hypercomputer team ported Flux inference implementation to MaxDiffusion and were able to successfully run both Flux-dev and Flux-schnell models using Google's TPUs.

Running tests on 1024 x 1024 images with flash attention and bfloat16 gave the following results:

Model Accelerator Sharding Strategy Batch Size Steps time (secs) Flux-dev v4-8 DDP 4 28 23 Flux-schnell v4-8 DDP 4 4 2.2 Flux-dev v6e-4 DDP 4 28 5.5 Flux-schnell v6e-4 DDP 4 4 0.8 Flux-schnell v6e-4 FSDP 4 4 1.2 We'd appreciate if you could give us some feedback on the above-mentioned results and our overall approach.

Hello, as a beginner, this has been very informative. I don’t have any prior experience with PyTorch, Diffusers, or similar frameworks, and I couldn’t find any clear documentation on how to run open-source image generation models like Flux Dev on TPUs.

On Google Cloud Platform, a single H100 (spot instance) costs around $1,800 per month, while a v6e-4 TPU instance (I assume this means 4 TPU chips) costs about $1,900 per month.

I’m currently trying to learn how to build my own image generation infrastructure. It’s a very interesting area. However, I’d like to hear your thoughts on what would be the best instance configuration for running image generation workloads in a setup like this.

Do you think using an H100 GPU instance (1 GPU) (around $1,800/month) would be a better choice than using a v6e TPU instance, in terms of performance and practicality for this type of architecture?

Thank you.

muzakon avatar Oct 31 '25 16:10 muzakon