bitsandbytes icon indicating copy to clipboard operation
bitsandbytes copied to clipboard

8-bit optimizers dont work with FSDP

Open prajdabre opened this issue 2 years ago • 24 comments

When I use an 8-bit ADAM with FSDP, I get an error as follows:

RuntimeError: output tensor must have the same type as input tensor

If my understanding is correct, there seems to be a casting issue. Is there any workaround this?

TIA.

prajdabre avatar Nov 05 '22 06:11 prajdabre

I looked at the deepspeed implementation before, which had a similar issue with shared weights. The problem was that the algorithm splits all tensors found in the optimizer state, which includes the quantization statistics. But this can lead to incorrect behavior. The workaround in deepspeed is to hide the quantization statistics by obscuring their type (putting the tensor into a list/tuple).

I am not sure if the error message that you provided is related to that or not.

It would be nice if we could get 8-bit Adam working for FSDP. Would you be able to provide a simple example for debugging and replication purposes? Since I will be pretty busy the next month, I would also be very happy to guide you on how to fix this if you create a PR and provide me with error messages / stack traces. I think it would be pretty useful since more and more people are using FSDP.

TimDettmers avatar Jan 03 '23 15:01 TimDettmers

Hey @TimDettmers,

I created a gist with an example. The gist includes a process_dataset.py to prepare a dataset and a run_clm_bnb8.py script, which uses the adamw_bnb_8bit optimizer and FSDP.

https://gist.github.com/philschmid/99410e8bf66d34e52bb0cd5270b07989

I hope that's enough for you to test it.

philschmid avatar Mar 24 '23 21:03 philschmid

I tested the example i shared with the adamw_bnb_8bit and adafactor it seems that its not working if the training runs.

AdamWInt8

{'loss': 2.6643, 'learning_rate': 4.847094801223242e-05, 'epoch': 0.09}
{'loss': 2.752, 'learning_rate': 4.694189602446483e-05, 'epoch': 0.18}
{'loss': 3.1493, 'learning_rate': 4.541284403669725e-05, 'epoch': 0.28}
{'loss': 3.412, 'learning_rate': 4.3883792048929664e-05, 'epoch': 0.37}
{'loss': 3.6722, 'learning_rate': 4.0825688073394495e-05, 'epoch': 0.55}

Adafactor

{'loss': 2.8385, 'learning_rate': 4.847094801223242e-05, 'epoch': 0.09}   
{'loss': 2.6384, 'learning_rate': 4.694189602446483e-05, 'epoch': 0.18}                   
{'loss': 2.5725, 'learning_rate': 4.541284403669725e-05, 'epoch': 0.28}
{'loss': 2.5757, 'learning_rate': 4.3883792048929664e-05, 'epoch': 0.37}
{'loss': 2.5297, 'learning_rate': 4.0825688073394495e-05, 'epoch': 0.55}                     

philschmid avatar Mar 28 '23 15:03 philschmid

Hi @TimDettmers in my latest test, it turns out that saving the model is the source of this issue.

Specifically the error pops up when I run this: optim_state = FSDP.full_optim_state_dict(model, optimizer)

What this is supposed to do is assemble the entire optimizer based on the model params. Now what I think is the problem is that the optimizer is in 8-bit but the model is not in 8-bit. The reason for my assumption is the error is thrown by

File "/share03/draj/environments/.conda/envs/yanmtt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2136, in _all_gather_base work = group._allgather_base(output_tensor, input_tensor)

Indeed if you look here: https://github.com/pytorch/pytorch/blob/55daa835e97a6e742cba1f0e9d2a5c78b1615e99/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2779

Then there is a constraint that the dtypes of tensors should be the same and we are not able to guarantee this for a sharded 8-bit optimizer.

If we can find some way to bypass this requirement, then we are good to go.

How do we overcome this issue?

prajdabre avatar Apr 03 '23 12:04 prajdabre

I have the same issue. #323 Is there any solution to solve this problem? @TimDettmers @prajdabre

Kyeongpil avatar Apr 19 '23 08:04 Kyeongpil

There is another issue. When I applied FSDP cpu offload with Adam8bit, I got the following error:

Expected a cuda device, but got: cpu
Traceback (most recent call last):
File "scripts/sft/run_train.py", line 509, in <module>
  main()
File "scripts/sft/run_train.py", line 503, in main
  run(artifact_config, train_config, experiment_config, execution_config)
File "scripts/sft/run_train.py", line 378, in run
  optimizer.step()
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
  return wrapped(*args, **kwargs)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/optimizer.py", line 140, in step
  self.optimizer.step(closure)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
  out = func(*args, **kwargs)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
  return func(*args, **kwargs)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/bitsandbytes/optim/optimizer.py", line 263, in step
  self.update_step(group, p, gindex, pindex)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
  return func(*args, **kwargs)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/bitsandbytes/optim/optimizer.py", line 504, in update_step
  F.optimizer_update_8bit_blockwise(
File "/home/kyeongpil/venv/lib/python3.8/site-packages/bitsandbytes/functional.py", line 972, in optimizer_update_8bit_blockwise
  prev_device = pre_call(g.device)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/bitsandbytes/functional.py", line 318, in pre_call
  torch.cuda.set_device(device)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/cuda/__init__.py", line 324, in set_device
  device = _get_device_index(device)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/cuda/_utils.py", line 30, in _get_device_index
  raise ValueError('Expected a cuda device, but got: {}'.format(device))
ValueError: Expected a cuda device, but got: cpuRank(3)

Kyeongpil avatar Apr 20 '23 08:04 Kyeongpil

Im not a 100% sure but this might be taken care of in pytorch 2.0.

prajdabre avatar Apr 28 '23 12:04 prajdabre

I encountered a similar issue using PEFT LoRA, load_in_8bit, and DeepSpeed 3 (optimizer and params offload) with huggingface accelerator. on a single gpu, training was fine as expected.

If anyone found a workaround to enable parallel training with PEFT LoRA and load_in_8bit, please let me know.

dotsnangles avatar May 07 '23 05:05 dotsnangles

it seems that pytorch 2 doesnot support 8bit

hscspring avatar May 16 '23 07:05 hscspring

anyone still working on this....?

on the error @prajdabre was mentioning, I find that the problem does not come from a dtype mismatch, but rather a size mismatch. With printf debugging, I noticed that this seemed to first error on the absmax1 value, with

output_tensor.shape == Size([361496576]), output_tensor.dtype == float32
input_tensor.shape == Size([22064]), input_tensor.dtype == float32

152334H avatar Jul 09 '23 14:07 152334H

cc @awgu

HamidShojanazeri avatar Jul 18 '23 17:07 HamidShojanazeri

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar Dec 20 '23 16:12 github-actions[bot]

Noting that this issue, although stale, remains an issue. Although optimization can run, a functional state dict cannot be saved with 8bitadam.

I notice that there is a PR for FSDP functionality in https://github.com/TimDettmers/bitsandbytes/pull/840. It generally does not address the state dict issue in its tests.

152334H avatar Dec 22 '23 09:12 152334H

@Titus-von-Koeller @TimDettmers sorry to hijack this issue. Doing something related but not exactly the same.

im trying to use FSDP with bitsandbytes==0.42.0 to finetune EleutherAI/pythia-1b that has 8bit weights

  • AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-1b", load_in_8bit=True)
  • added lora adapters, and i have different FSDP wrappers for anything that is not bnb.Linear8bitLt
     GPTNeoXLayer(
        (input_layernorm): FullyShardedDataParallel(
          (_fsdp_wrapped_module): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        )
        (post_attention_layernorm): FullyShardedDataParallel(
          (_fsdp_wrapped_module): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        )
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)                                                                                                                                                                                                (attention): GPTNeoXAttention(
          (rotary_emb): FullyShardedDataParallel(
            (_fsdp_wrapped_module): GPTNeoXRotaryEmbedding()
          )
          (query_key_value): lora.Linear8bitLt(
            (base_layer): Linear8bitLt(in_features=2048, out_features=6144, bias=True)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): FullyShardedDataParallel(
                (_fsdp_wrapped_module): Linear(in_features=2048, out_features=8, bias=False)
              )
            )
            (lora_B): ModuleDict(
              (default): FullyShardedDataParallel(
                (_fsdp_wrapped_module): Linear(in_features=8, out_features=6144, bias=False)
              )
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (dense): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear8bitLt(in_features=2048, out_features=8192, bias=True)
          (dense_4h_to_h): Linear8bitLt(in_features=8192, out_features=2048, bias=True)
          (act): GELUActivation()
        )
      )
      ```
    
    

The FSDP wrapping will fail at _validate_tensors_to_flatten when it tries to flatten Linear8bitLt for sharding. This is because Linear8bitLt.dtype is torch.int8, and _validate_tensors_to_flatten requires that it be a floating point type.

fabianlim avatar Jan 10 '24 05:01 fabianlim

Noting that this issue, although stale, remains an issue. Although optimization can run, a functional state dict cannot be saved with 8bitadam.

@152334H when you were trying this, did you load the model in 4/8b precision? or the model is in 32b precision, but you want to activate adamw_bnb_8bit

fabianlim avatar Jan 10 '24 05:01 fabianlim

? I do not test via huggingface.

I was in fact trying to only use an 8bit optimiser with 32bit weights, though, so I do not experience the int8 flatparameter issue you do.

152334H avatar Jan 10 '24 06:01 152334H

Hey @152334H @fabianlim @HamidShojanazeri @prajdabre @Kyeongpil @hscspring @dotsnangles @philschmid,

Could some of you please retest this and let us know if the particular problems that you were observing persist in the same for or if different, please put forward detailed logs + description?

We just released official FSDP support in the latest BNB version. However, this release was not focused on 8-bit optimizer support, yet.

Be sure to install with

pip install bitsandbytes>=0.43.0

Titus-von-Koeller avatar Mar 19 '24 11:03 Titus-von-Koeller

@Titus-von-Koeller @TimDettmers I think the problem still remains even with BNB 0.43. The reason is because BNB performs optimizer steps with CUDA.

  1. when using CPU offload, the gradients are put onto the CPU
  2. However before the BNB 8bit optimizer step, there is a pre_call to put all of the tensors onto the same GPU
 prev_device = pre_call(g.device)
  1. However since the gradient g is on cpu, it is obvious why pre_call will fail, since now device="cpu" below:
 def pre_call(device):
    prev_device = torch.cuda.current_device()
    torch.cuda.set_device(device)
    return prev_device
  1. And finally, all of the optimizer quantities in the is_on_gpu call are on the cpu
 is_on_gpu([g, p, state1, state2, qmap1, qmap2, absmax1, absmax2])

Thus while one can move all of the above quanities to gpu -> compute -> cpu. Im not sure if this is the most optimal way to do things as it will involve a lot of IO overhead.

fabianlim avatar Mar 20 '24 02:03 fabianlim

@fabianlim Yes, you're right! Thanks for the detailed analysis, this really helps make things actionable.

I'll put it on my list of things to look into, but can't promise a timeline. We have a lot on our plate in the immediate future, as there are a lot of necessary changes that need to be prioritize to make BNB more maintainable and easier to contribute to.

In case you're interested to work with us on finding a solution, we would be super happy to collaborate and support you in any way!

Titus-von-Koeller avatar Mar 20 '24 18:03 Titus-von-Koeller

@Titus-von-Koeller On one hand, we can workaround this by loading all the quantities onto GPU, but this will be very inefficient. On the other hand, I feel the better approach would be to run the optimizer step alongside the FSDP sharding.

As we see here, the optimizer step can be run after the FSDP post grad hook. There is a comment there to say that for CPU offload the parameters and gradients are run on CPU, but this should not be the case. If during offload, we can run the optimizer step in GPU before it gets offloaded, then this solves our problem and we do not need to shuffle params around

I have posted a comment on pytorch asking when FSDP will start to support running optim.step on the GPU. I will keep you updated when I get a response.

fabianlim avatar Mar 21 '24 01:03 fabianlim