Bagheera
Bagheera
try `grep oom /proc/vmstat` which will show you how many oomkiller events have occurred without dmesg access
system RAM yes
added a note about this to the flux quickstart
Resolves #656
i'm not sure i really have the bandwidth to look into this one, someone else with the equipment and ability to reproduce the issue might have to take a look,...
no no, i really dislike conda :D
this was confirmed to be working on the main branch now. the train.sh script is updated to locate and rely on the nvidia libraries in the venv.
actually, using `aot_eager` gets autograd involved and then dtype complaints happen. the gradients need to be in fp32 precision ... for a low bit optim? 🤔
yeah simpletuner supports finetuning diffusion models via torch-mps w/ or w/o optimum-quanto up to the 12B parameter Flux model, which really takes advantage of quantisation, down from 30G at pure...
either way not seeing memory savings with the 8bit adamw as i need the gradients to be upcast to fp32. the 4bit optim uses some ops not implemented on MPS...