Bagheera comments

Results 446 comments of


                                            Bagheera

[mps] training / inference dtype issues

something still not quite right for mps, maybe the lack of 8bit optimisers really hurts more than i'd think, haha. we see sampling speed improvements up to bsz=8 and then...

[mps] training / inference dtype issues

![image](https://github.com/huggingface/diffusers/assets/59658056/b978e62e-8065-47ce-af0b-dac2f412542d)

[mps] training / inference dtype issues

tested the above training implementation on (so far) 300 steps of `ptx0/photo-concept-bucket` at a decent learning rate and batch size of 4 on an apple m3 max it's definitely learning....

[mps] training / inference dtype issues

unfortunately i hit a NaN at the 628th step of training, approximately the same place as before

[mps] training / inference dtype issues

looks like it could be https://github.com/pytorch/pytorch/issues/118115 as both optimizers in use that fail in this way do use addcdiv

[mps] training / inference dtype issues

it crashed after 628 steps and then on resume, it crashed after 300 steps, on the 901st. it also seems to get a lot slower than it should sometimes -...

[mps] training / inference dtype issues

@sayakpaul you know what it ended up being is a cached latent with NaN values. i ran the SDXL VAE in fp16 mode since i was using pytorch 2.2 a...

[mps] training / inference dtype issues

using the madebyollins sdxl vae fp16 model it occasionally NaNs, but not often enough to find the issue right away

[mps] training / inference dtype issues

on a new platform, the workarounds that are required for all platforms might not be added yet. eg. cuda handles type casting automatically, but mps requires strict types - any...

[mps] training / inference dtype issues

fp16 inference is thrown out long ago * sdxl's vae doesn't work with it * sd 2.1's unet doesn't work with it