ColossalAI
ColossalAI copied to clipboard
[BUG]: CUDA out of memory
🐛 Describe the bug
tried to run train_sft.sh with error: OOM orch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.68 GiB total capacity; 18.08 GiB already allocated; 73.00 MiB free; 22.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
by the way my GPU :3090*2
here is script:
torchrun --standalone --nproc_per_node=2 train_sft.py
--pretrain "/home/kidd/projects/llms/colossal-ai/ColossalAI/llama-7b/"
--model 'llama'
--strategy colossalai_gemini
--log_interval 10
--save_path /home/kidd/projects/llms/colossal-ai/ColossalAI/coati-7b
--dataset /home/kidd/projects/llms/colossal-ai/ColossalAI/data_set/instinwild_ch.json
--batch_size 4
--accimulation_steps 8
--lr 2e-5
--max_datasets_size 512
--max_epochs 1 \
Environment
------------ Environment ------------ Colossal-AI version: 0.2.8 PyTorch version: 2.0.0 System CUDA version: 11.7 CUDA version required by PyTorch: 11.7
Note:
- The table above checks the versions of the libraries/tools in the current environment
- If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
- If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.
------------ CUDA Extensions AOT Compilation ------------ Found AOT CUDA Extension: x PyTorch version used for AOT compilation: N/A CUDA version used for AOT compilation: N/A
Note:
- AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
- If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
------------ Compatibility ------------ PyTorch version match: N/A System and PyTorch CUDA version match: ✓ System and Colossal-AI CUDA version match: N/A
@JThh Hello, I just try placement=cpu,but it still get that error:
[04/10/23 15:35:51] INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/dataset/sft_dataset.py:124 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/dataset/sft_dataset.py:126 __init__
INFO colossalai - colossalai - INFO: Loaded 51504 examples.
INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/dataset/sft_dataset.py:129 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 512 examples.
INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/dataset/sft_dataset.py:132 __init__
INFO colossalai - colossalai - INFO: Formatting inputs...
INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/dataset/sft_dataset.py:140 __init__
INFO colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...
Traceback (most recent call last):
File "train_sft.py", line 184, in <module>
train(args)
File "train_sft.py", line 146, in train
trainer = SFTTrainer(model=model,
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/trainer/sft.py", line 61, in __init__
self.optimizer = strategy.setup_optimizer(optim, self.model)
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/trainer/strategies/colossalai.py", line 148, in setup_optimizer
return zero_optim_wrapper(model, optimizer, optim_config=self.zero_optim_config, **self.optim_kwargs)
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/colossalai/nn/parallel/zero_wrapper.py", line 105, in zero_optim_wrapper
return LowLevelZeroOptimizer(optimizer, **config_dict)
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 175, in __init__
fp32_flat_current_rank = fp16_flat_current_rank.float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 25.10 GiB (GPU 0; 31.75 GiB total capacity; 12.58 GiB already allocated; 18.29 GiB free; 12.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3939) of binary: /root/anaconda3/envs/coati/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/coati/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_sft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-10_15:36:03
host : gpu19
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3939)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
My GPU is V100, and memary is 32G
I got the same issue while using A100. After setting --nproc_per_node=8, the issue fixed.
Hi @janglichao @610lyn We have add How to train with limited resources. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#faq
the same error,2*V100 GPU,have you solved this problem?