Alex Redden

Results 82 comments of Alex Redden

1. You can optionally quantize the others by setting `"quantize_flow_embedder_layers": true`, but it does pretty considerably reduce quality and doesn't add much extra vram or increase it/s. The non-single-or-double-block layers...

Not really- it has enough sram where it gets the same tflops for fp16 w/ fp32 accumulate as it does for fp16 w/ fp16 accumulate. @spejamas

I'm sure it would work with tinkering, not sure how different the architecture is, but if it's similar to normal flux then it will work. If it's more of a...

Ah that's interesting. Are you using the latest code? There was a bug earlier where it was always setting the lora alpha to 1.0 for huggingface diffusers loras. Though it...

It's possible that there are some lora loading specifics that I didn't implement well- but I'm not entire sure what that would be. I will have to look into other...

If you're getting black images I would recommend setting flow_dtype to bfloat16, it should help a bit. I'm still a bit unsure as to how I am supposed to handle...

Thanks 😄 - well if you find anywhere in my lora loading implementation here https://github.com/aredden/flux-fp8-api/blob/main/lora_loading.py let me know and I'll change it, or you can submit a pull request and...

Seems like most of the delay is actually synchronization, which sort of implies that the slowdown is actually something else in the code prior. The way torch works is each...

Actulally, it could be because you may have set autoencoder offloading to true, so- in that case it could be that the slowdown is moving the vae to gpu, encoding,...

I'm not entirely sure what the slowdown would be- though an L4 has pretty low wattage limits so it might be related it throttling because of wattage limits. I would...