Firefly icon indicating copy to clipboard operation
Firefly copied to clipboard

Fully fine-tuning the Baichuan2-13B-Chat model has consistently reported errors, but Baichuan-13B-Chat can proceed normally.

Open angenge opened this issue 1 year ago • 4 comments

The error message is as follows:

File "/workdir/script/Firefly/train.py", line 130, in main() File "/workdir/script/Firefly/train.py", line 118, in main train_result = trainer.train() File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in training_step self.accelerator.backward(loss) File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1861, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1993, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.20 GiB (GPU 6; 39.39 GiB total capacity; 34.87 GiB already allocated; 1.04 GiB free; 36.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

angenge avatar Sep 15 '23 02:09 angenge

后来解决了吗? 我也遇到了这个问题

jijeng avatar Sep 20 '23 22:09 jijeng

也遇到了相同问题,改小batchsize 也不行

dachengai avatar Sep 22 '23 01:09 dachengai

I have the same issue, is this resolved?

ckleong17 avatar Sep 25 '23 06:09 ckleong17

If you are using FusedAdam as your optimizer, this error probably is caused by that. You may turn on offload to use DeepSpeedCPUAdam optimizer, it should solve this error.

ckleong17 avatar Sep 26 '23 05:09 ckleong17