Out of Memory Error during execution of bash scripts/TrainStage1_7b.sh
-
While executing the script bash scripts/TrainStage1_7b.sh, I encountered an Out of Memory (OOM) error. The error message is as follows:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 23.59 GiB of which 104.75 MiB is free. Including non-PyTorch memory, this process has 23.42 GiB memory in use. Of the allocated memory, 23.17 GiB is allocated by PyTorch, and 2.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
The error seems to occur at the following point in the code:
trainer = AlignLLMwithSDCLIPTrainer(model=model, tokenizer=llm_tokenizer, args=training_args, **data_module)
System Info:
PyTorch version: 2.1.0 Transformers version: 4.28.1 GPUs: 2x NVIDIA RTX 3090 RAM: 128GB CUDA version: 11.8
Given these specs, I’m wondering if it’s feasible to train the model without encountering OOM errors, and if there are any suggestions for resolving the memory issues.
- Attempt with DeepSpeed:
To mitigate the OOM issue, I tried implementing DeepSpeed, but ran into compatibility issues. I am using DeepSpeed version 0.15.3 (as version 0.7.3 did not work due to the deprecation of torch._six). The following is the config file I used for DeepSpeed:
{
"train_micro_batch_size_per_gpu": 1,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 2e-4
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
}
},
"fp16": {
"enabled": true
}
}
I also modified the TrainStage1.py file as follows to include DeepSpeed:
def train():
global local_rank
# Add DeepSpeed config file path
deepspeed_config_path = "../deepspeed_config.json"
parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
# Add DeepSpeed config path to TrainingArguments
training_args.deepspeed = deepspeed_config_path
local_rank = training_args.local_rank
...
However, I encountered the following error message when trying to run the modified code:
AttributeError: 'TrainingArguments' object has no attribute 'hf_deepspeed_config'
Even though I explicitly added the path to the DeepSpeed config file, this error persists. Could you provide any guidance on how to resolve the OOM issue using DeepSpeed, or suggest which version of DeepSpeed is compatible with Transformers 4.28.1? Any advice would be greatly appreciated.