Support 8bit Optimizers on CPU
Feature request
Hi thanks for the library! It would be great if the optimizers can be run on CPU. For example, I would like to try adamw_8bit to full-finetune a 8B model on a 24GB GPU card (RTX4090). With deepspeed offload, the GPU memory is OK, but the CPU memory requirement is still very huge, partially because it uses normal adamw, thus needs 8x8=64GB for the optimizer itself.
This package creates the super helpful adamw_8bit, thus I would appreciate it if it can be used with the settings above, hopefully reducing 64GB to 8x2=16GB for optimizer state.
Motivation
(see above)
Your contribution
Yes
and iam
See #1021. I proposed that this should be a step on the path of implementing cross platform support (especially Apple Silicon, as CUDA and Apple Silicon won't run on the same hardware, which makes validation complicated)
run 4 bit on cpu only how
This appears to be needed for accelerate as PyTorch 2.9.1 (at time of writing) currently does not support operators such as bitsandbytes::optimizer_update_8bit_blockwise on neither MPS nor CPU (with PYTORCH_ENABLE_MPS_FALLBACK=1 set).