alpaca-lora A100 80 G fine tune llama-65b-hf got CUDAout of Memory

when i strart train， it works fine，the parameters as follows

and the gpu usage is as follows：

but when complete 17% it exits，the logs is here：

Apr 24 '23 09:04 elven2016

What's your bitsandbytes version?

Apr 24 '23 11:04 lywinged

What's your bitsandbytes version?你的bitsandbytes版本是什么？ Name: bitsandbytes Version: 0.38.1

Apr 24 '23 13:04 elven2016

What's your bitsandbytes version?你的bitsandbytes版本是什么？ Name: bitsandbytes Version: 0.38.1

uninstall then try bitsandbytes==0.37.2 ？

Apr 24 '23 13:04 lywinged

What's your bitsandbytes version?你的bitsandbytes版本是什么？ Name: bitsandbytes Version: 0.38.1

uninstall then try bitsandbytes==0.37.2 ？

I have use the default parameter to run fine tune again，I will look if it will fail or not，if fail then i will try your solution，thanks

Apr 24 '23 13:04 elven2016

What's your bitsandbytes version?你的bitsandbytes版本是什么？ Name: bitsandbytes Version: 0.38.1

uninstall then try bitsandbytes==0.37.2 ？

I have use the default parameter to run fine tune again，I will look if it will fail or not，if fail then i will try your solution，thanks

it's a pity that cuda OOM again

Apr 25 '23 00:04 elven2016

bitsandbytes==0.37.2 failed? Use 2 x 80G

Apr 25 '23 03:04 lywinged

bitsandbytes==0.37.2 failed? Use 2 x 80G

Only one card A100 80G

Apr 25 '23 04:04 elven2016

I'm having the same issue working with the 7b, also on an a100 80gb, or which ever gpu. The stack trace is below:

Loading cached split indices for dataset at /root/.cache/huggingface/datasets/victor123___json/victor123--evol_instruct_70k-de37dd5750ecc166/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-f60fe2ec6b4cb6b4.arrow and /root/.cache/huggingface/datasets/victor123___json/victor123--evol_instruct_70k-de37dd5750ecc166/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-b5791538b539bc0d.arrow
  0%|                                                                    | 0/3186 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/alpaca-lora/finetune.py", line 283, in <module>
    fire.Fire(train)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/root/alpaca-lora/finetune.py", line 273, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2709, in training_step
    self.scaler.scale(loss).backward()
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 141, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 565, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 157, in forward
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 320, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 500, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 417, in forward
    output += torch.matmul(subA, state.subB)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 79.21 GiB total capacity; 76.04 GiB already allocated; 101.56 MiB free; 77.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                    | 0/3186 [00:11<?, ?it/s]

I've tried both bitsandbytes 0.37.2 and the latest

Apr 25 '23 19:04 getorca

upgrading transformers to the dev version pip install git+https://github.com/huggingface/transformers seems to have resolved it for me.

Apr 25 '23 20:04 getorca

upgrading transformers to the dev version pip install git+https://github.com/huggingface/transformers seems to have resolved it. thanks，this

bitsandbytes==0.37.2 failed? Use 2 x 80G

Thanks，bitsandbytes==0.37.2 maybe ok，now it gets over 17% 03635347 ，“75%|███████▌ | 875/1164 [23:44:24<7:44:41” not failed

Apr 26 '23 01:04 elven2016

upgrading transformers to the dev version pip install git+https://github.com/huggingface/transformers seems to have resolved it. bitsandbytes==0.37.2 also works

Apr 26 '23 01:04 elven2016

I did both of the suggestions (had bnb 0.37.2 and latest git transformers) but still ran into the issue

Apr 30 '23 10:04 teknium1

hi, @elven2016. Do you have encounted a phenomenon that the loss value change to upward trend after one epoch(total 3-epoch) ? like:

May 17 '23 05:05 PeiqinSun

transformers == 4.29.2 bitsandbytes==0.37.2 peft==0.3.0

Using multi-GPU fine tune llama-65b-hf still got CUDA out of memory.

May 31 '23 01:05 ricksun2023

Hi, the mid-way stop could be due to the sequence length of the data. It worked out for the first 16% because the sequence length was below the maximum but maybe at 17% there is a sequence that is the max length and it pushes above the memory limit. It happened to me in the past, you can inspect the data to see whether is this true.

Jun 10 '23 15:06 timothylimyl

@PeiqinSun Hello, I have encountered the same issue. The loss dramatically decreases after each epoch and then gradually increases. I have also observed this phenomenon in a paper. After conducting preliminary tests, I found that this stair-like descent does not significantly affect the model's performance. Additionally, I speculate that this may be related to the data. Do you have any latest findings or progress? I am very curious about this phenomenon but don't have any debugging clues. Many thanks!

Jul 18 '23 09:07 s1ghhh

alpaca-lora alpaca-lora copied to clipboard

A100 80 G fine tune llama-65b-hf got CUDAout of Memory

alpaca-lora
alpaca-lora copied to clipboard