bitsandbytes
bitsandbytes copied to clipboard
8-bit optimizers dont work with FSDP
When I use an 8-bit ADAM with FSDP, I get an error as follows:
RuntimeError: output tensor must have the same type as input tensor
If my understanding is correct, there seems to be a casting issue. Is there any workaround this?
TIA.
I looked at the deepspeed implementation before, which had a similar issue with shared weights. The problem was that the algorithm splits all tensors found in the optimizer state, which includes the quantization statistics. But this can lead to incorrect behavior. The workaround in deepspeed is to hide the quantization statistics by obscuring their type (putting the tensor into a list/tuple).
I am not sure if the error message that you provided is related to that or not.
It would be nice if we could get 8-bit Adam working for FSDP. Would you be able to provide a simple example for debugging and replication purposes? Since I will be pretty busy the next month, I would also be very happy to guide you on how to fix this if you create a PR and provide me with error messages / stack traces. I think it would be pretty useful since more and more people are using FSDP.
Hey @TimDettmers,
I created a gist with an example. The gist includes a process_dataset.py
to prepare a dataset and a run_clm_bnb8.py
script, which uses the adamw_bnb_8bit
optimizer and FSDP
.
https://gist.github.com/philschmid/99410e8bf66d34e52bb0cd5270b07989
I hope that's enough for you to test it.
I tested the example i shared with the adamw_bnb_8bit
and adafactor
it seems that its not working if the training runs.
AdamWInt8
{'loss': 2.6643, 'learning_rate': 4.847094801223242e-05, 'epoch': 0.09}
{'loss': 2.752, 'learning_rate': 4.694189602446483e-05, 'epoch': 0.18}
{'loss': 3.1493, 'learning_rate': 4.541284403669725e-05, 'epoch': 0.28}
{'loss': 3.412, 'learning_rate': 4.3883792048929664e-05, 'epoch': 0.37}
{'loss': 3.6722, 'learning_rate': 4.0825688073394495e-05, 'epoch': 0.55}
Adafactor
{'loss': 2.8385, 'learning_rate': 4.847094801223242e-05, 'epoch': 0.09}
{'loss': 2.6384, 'learning_rate': 4.694189602446483e-05, 'epoch': 0.18}
{'loss': 2.5725, 'learning_rate': 4.541284403669725e-05, 'epoch': 0.28}
{'loss': 2.5757, 'learning_rate': 4.3883792048929664e-05, 'epoch': 0.37}
{'loss': 2.5297, 'learning_rate': 4.0825688073394495e-05, 'epoch': 0.55}
Hi @TimDettmers in my latest test, it turns out that saving the model is the source of this issue.
Specifically the error pops up when I run this: optim_state = FSDP.full_optim_state_dict(model, optimizer)
What this is supposed to do is assemble the entire optimizer based on the model params. Now what I think is the problem is that the optimizer is in 8-bit but the model is not in 8-bit. The reason for my assumption is the error is thrown by
File "/share03/draj/environments/.conda/envs/yanmtt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2136, in _all_gather_base work = group._allgather_base(output_tensor, input_tensor)
Indeed if you look here: https://github.com/pytorch/pytorch/blob/55daa835e97a6e742cba1f0e9d2a5c78b1615e99/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2779
Then there is a constraint that the dtypes of tensors should be the same and we are not able to guarantee this for a sharded 8-bit optimizer.
If we can find some way to bypass this requirement, then we are good to go.
How do we overcome this issue?
I have the same issue. #323 Is there any solution to solve this problem? @TimDettmers @prajdabre
There is another issue. When I applied FSDP cpu offload with Adam8bit, I got the following error:
Expected a cuda device, but got: cpu
Traceback (most recent call last):
File "scripts/sft/run_train.py", line 509, in <module>
main()
File "scripts/sft/run_train.py", line 503, in main
run(artifact_config, train_config, experiment_config, execution_config)
File "scripts/sft/run_train.py", line 378, in run
optimizer.step()
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/optimizer.py", line 140, in step
self.optimizer.step(closure)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/bitsandbytes/optim/optimizer.py", line 263, in step
self.update_step(group, p, gindex, pindex)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/bitsandbytes/optim/optimizer.py", line 504, in update_step
F.optimizer_update_8bit_blockwise(
File "/home/kyeongpil/venv/lib/python3.8/site-packages/bitsandbytes/functional.py", line 972, in optimizer_update_8bit_blockwise
prev_device = pre_call(g.device)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/bitsandbytes/functional.py", line 318, in pre_call
torch.cuda.set_device(device)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/cuda/__init__.py", line 324, in set_device
device = _get_device_index(device)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/cuda/_utils.py", line 30, in _get_device_index
raise ValueError('Expected a cuda device, but got: {}'.format(device))
ValueError: Expected a cuda device, but got: cpuRank(3)
Im not a 100% sure but this might be taken care of in pytorch 2.0.
I encountered a similar issue using PEFT LoRA, load_in_8bit, and DeepSpeed 3 (optimizer and params offload) with huggingface accelerator. on a single gpu, training was fine as expected.
If anyone found a workaround to enable parallel training with PEFT LoRA and load_in_8bit, please let me know.
it seems that pytorch 2 doesnot support 8bit
anyone still working on this....?
on the error @prajdabre was mentioning, I find that the problem does not come from a dtype mismatch, but rather a size mismatch. With printf debugging, I noticed that this seemed to first error on the absmax1 value, with
output_tensor.shape == Size([361496576]), output_tensor.dtype == float32
input_tensor.shape == Size([22064]), input_tensor.dtype == float32
cc @awgu
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Noting that this issue, although stale, remains an issue. Although optimization can run, a functional state dict cannot be saved with 8bitadam.
I notice that there is a PR for FSDP functionality in https://github.com/TimDettmers/bitsandbytes/pull/840. It generally does not address the state dict issue in its tests.
@Titus-von-Koeller @TimDettmers sorry to hijack this issue. Doing something related but not exactly the same.
im trying to use FSDP with bitsandbytes==0.42.0
to finetune EleutherAI/pythia-1b
that has 8bit weights
-
AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-1b", load_in_8bit=True)
- added lora adapters, and i have different FSDP wrappers for anything that is not
bnb.Linear8bitLt
GPTNeoXLayer( (input_layernorm): FullyShardedDataParallel( (_fsdp_wrapped_module): LayerNorm((2048,), eps=1e-05, elementwise_affine=True) ) (post_attention_layernorm): FullyShardedDataParallel( (_fsdp_wrapped_module): LayerNorm((2048,), eps=1e-05, elementwise_affine=True) ) (post_attention_dropout): Dropout(p=0.0, inplace=False) (post_mlp_dropout): Dropout(p=0.0, inplace=False) (attention): GPTNeoXAttention( (rotary_emb): FullyShardedDataParallel( (_fsdp_wrapped_module): GPTNeoXRotaryEmbedding() ) (query_key_value): lora.Linear8bitLt( (base_layer): Linear8bitLt(in_features=2048, out_features=6144, bias=True) (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): FullyShardedDataParallel( (_fsdp_wrapped_module): Linear(in_features=2048, out_features=8, bias=False) ) ) (lora_B): ModuleDict( (default): FullyShardedDataParallel( (_fsdp_wrapped_module): Linear(in_features=8, out_features=6144, bias=False) ) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (dense): Linear8bitLt(in_features=2048, out_features=2048, bias=True) (attention_dropout): Dropout(p=0.0, inplace=False) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear8bitLt(in_features=2048, out_features=8192, bias=True) (dense_4h_to_h): Linear8bitLt(in_features=8192, out_features=2048, bias=True) (act): GELUActivation() ) ) ```
The FSDP wrapping will fail at _validate_tensors_to_flatten
when it tries to flatten Linear8bitLt
for sharding. This is because Linear8bitLt.dtype
is torch.int8
, and _validate_tensors_to_flatten
requires that it be a floating point type.
Noting that this issue, although stale, remains an issue. Although optimization can run, a functional state dict cannot be saved with 8bitadam.
@152334H when you were trying this, did you load the model in 4/8b precision? or the model is in 32b precision, but you want to activate adamw_bnb_8bit
? I do not test via huggingface.
I was in fact trying to only use an 8bit optimiser with 32bit weights, though, so I do not experience the int8 flatparameter issue you do.
Hey @152334H @fabianlim @HamidShojanazeri @prajdabre @Kyeongpil @hscspring @dotsnangles @philschmid,
Could some of you please retest this and let us know if the particular problems that you were observing persist in the same for or if different, please put forward detailed logs + description?
We just released official FSDP support in the latest BNB version. However, this release was not focused on 8-bit optimizer support, yet.
Be sure to install with
pip install bitsandbytes>=0.43.0
@Titus-von-Koeller @TimDettmers I think the problem still remains even with BNB 0.43. The reason is because BNB performs optimizer steps with CUDA.
- when using CPU offload, the gradients are put onto the CPU
- However before the BNB 8bit optimizer step, there is a
pre_call
to put all of the tensors onto the same GPU
prev_device = pre_call(g.device)
- However since the gradient
g
is oncpu
, it is obvious whypre_call
will fail, since nowdevice="cpu"
below:
def pre_call(device):
prev_device = torch.cuda.current_device()
torch.cuda.set_device(device)
return prev_device
- And finally, all of the optimizer quantities in the
is_on_gpu
call are on thecpu
is_on_gpu([g, p, state1, state2, qmap1, qmap2, absmax1, absmax2])
Thus while one can move all of the above quanities to gpu -> compute -> cpu. Im not sure if this is the most optimal way to do things as it will involve a lot of IO overhead.
@fabianlim Yes, you're right! Thanks for the detailed analysis, this really helps make things actionable.
I'll put it on my list of things to look into, but can't promise a timeline. We have a lot on our plate in the immediate future, as there are a lot of necessary changes that need to be prioritize to make BNB more maintainable and easier to contribute to.
In case you're interested to work with us on finding a solution, we would be super happy to collaborate and support you in any way!
@Titus-von-Koeller On one hand, we can workaround this by loading all the quantities onto GPU, but this will be very inefficient. On the other hand, I feel the better approach would be to run the optimizer step alongside the FSDP sharding.
As we see here, the optimizer step can be run after the FSDP post grad hook. There is a comment there to say that for CPU offload the parameters and gradients are run on CPU, but this should not be the case. If during offload, we can run the optimizer step in GPU before it gets offloaded, then this solves our problem and we do not need to shuffle params around
I have posted a comment on pytorch asking when FSDP will start to support running optim.step
on the GPU. I will keep you updated when I get a response.