Bagheera comments

Results 447 comments of


                                            Bagheera

Sometimes training process dies with Signals.SIGKILL 9

try `grep oom /proc/vmstat` which will show you how many oomkiller events have occurred without dmesg access

Sometimes training process dies with Signals.SIGKILL 9

system RAM yes

Sometimes training process dies with Signals.SIGKILL 9

added a note about this to the flux quickstart

merge: reorder model casting, fix kolors on newer diffusers, update cuda, refactor save hooks

Resolves #656

Error on saving Flux lora, Deepspeed missing CUDA_HOME

i'm not sure i really have the bandwidth to look into this one, someone else with the equipment and ability to reproduce the issue might have to take a look,...

Error on saving Flux lora, Deepspeed missing CUDA_HOME

no no, i really dislike conda :D

Error on saving Flux lora, Deepspeed missing CUDA_HOME

this was confirmed to be working on the main branch now. the train.sh script is updated to locate and rely on the nvidia libraries in the venv.

[MPS] torchao low-bit-precision optim does not expose 'backend' argument to torch.compile

actually, using `aot_eager` gets autograd involved and then dtype complaints happen. the gradients need to be in fp32 precision ... for a low bit optim? 🤔

[MPS] torchao low-bit-precision optim does not expose 'backend' argument to torch.compile

yeah simpletuner supports finetuning diffusion models via torch-mps w/ or w/o optimum-quanto up to the 12B parameter Flux model, which really takes advantage of quantisation, down from 30G at pure...

[MPS] torchao low-bit-precision optim does not expose 'backend' argument to torch.compile

either way not seeing memory savings with the 8bit adamw as i need the gradients to be upcast to fp32. the 4bit optim uses some ops not implemented on MPS...