The error message is as follows:
File "/workdir/script/Firefly/train.py", line 130, in
main()
File "/workdir/script/Firefly/train.py", line 118, in main
train_result = trainer.train()
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1861, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1993, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.20 GiB (GPU 6; 39.39 GiB total capacity; 34.87 GiB already allocated; 1.04 GiB free; 36.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have the same issue, is this resolved?
If you are using FusedAdam as your optimizer, this error probably is caused by that. You may turn on offload to use DeepSpeedCPUAdam optimizer, it should solve this error.