ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: On the eight-card A100, testing the 'examples/language/llama2' with the 'gemini_auto' plugin resulted in an 'out of memory' error."

Open chensimian opened this issue 1 year ago • 1 comments

🐛 Describe the bug

Here are my script, it can run with hybrid_parallel plugin, but other plugins have the same error "out of memory" torchrun --standalone --nproc_per_node 8 finetune.py
--plugin "gemini_auto"
--dataset "self_instruct"
--model_path "Llama2-Chinese-7b-Chat"
--task_name "finetuning"
--batch_size 2
--save_dir "output_test"

Environment

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 4; 79.21 GiB total capacity; 75.40 GiB already allocated; 1.74 GiB free; 76.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

chensimian avatar Nov 09 '23 03:11 chensimian

Hi, how about trying to set offload_optim_frac and offload_param_frac to 1.0?

flybird11111 avatar Dec 11 '23 06:12 flybird11111