ColossalAI
ColossalAI copied to clipboard
[BUG]: CUDA out of memory. Tried to allocate 25.10 GiB
π Describe the bug
I get CUDA out of memory. Tried to allocate 25.10 GiB when run train_sft.sh, I t need 25.1GB, and My GPU is V100 and memory is 32G, but still get this error:
[04/10/23 15:34:46] INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[04/10/23 15:34:49] INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42, ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:07<00:00, 3.52s/it]
False
[extension] Compiling or loading the JIT-built cpu_adam kernel during runtime now
Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.13_cu11.6/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
[extension] Time to compile or load cpu_adam op: 0.3194923400878906 seconds
False
[extension] Compiling or loading the JIT-built fused_optim kernel during runtime now
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.13_cu11.6/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
[extension] Time to compile or load fused_optim op: 0.29384517669677734 seconds
[04/10/23 15:35:51] INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/dataset/sft_dataset.py:124 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/dataset/sft_dataset.py:126 __init__
INFO colossalai - colossalai - INFO: Loaded 51504 examples.
INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/dataset/sft_dataset.py:129 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 512 examples.
INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/dataset/sft_dataset.py:132 __init__
INFO colossalai - colossalai - INFO: Formatting inputs...
INFO colossalai - colossalai - INFO: /root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/dataset/sft_dataset.py:140 __init__
INFO colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...
Traceback (most recent call last):
File "train_sft.py", line 184, in <module>
train(args)
File "train_sft.py", line 146, in train
trainer = SFTTrainer(model=model,
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/trainer/sft.py", line 61, in __init__
self.optimizer = strategy.setup_optimizer(optim, self.model)
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/coati/trainer/strategies/colossalai.py", line 148, in setup_optimizer
return zero_optim_wrapper(model, optimizer, optim_config=self.zero_optim_config, **self.optim_kwargs)
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/colossalai/nn/parallel/zero_wrapper.py", line 105, in zero_optim_wrapper
return LowLevelZeroOptimizer(optimizer, **config_dict)
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 175, in __init__
fp32_flat_current_rank = fp16_flat_current_rank.float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 25.10 GiB (GPU 0; 31.75 GiB total capacity; 12.58 GiB already allocated; 18.29 GiB free; 12.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3939) of binary: /root/anaconda3/envs/coati/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/coati/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_sft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-10_15:36:03
host : gpu19
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3939)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
I still try placement=cpuοΌbut still get this error
Environment
CUDAοΌ11.6 pytorchοΌ1.13.1 GPUοΌ6 * V100οΌeach memory is 32G
Single GPU Card would be hard; would you please try increasing your number of gpus for training?
Hi @Tian14267 We have add How to train with limited resources. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#faq