ColossalAI-Examples
ColossalAI-Examples copied to clipboard
ColossalAI cannot run the shufflenet_v2_x1_0 model as torch do
🐛 Describe the bug
models.shufflenet_v2_x1_0 can be trained with BATCH_SIZE = 16384, which cannot be run successfully with ColossalAI.
The information is below:
(conda-general) user@user:~/research/Experiments/ColossalAI-Examples/image/resnet$ colossalai run --nproc_per_node 1 train.py
[06/16/22 13:30:42] INFO colossalai - torch.distributed.distributed_c10d -
INFO: Added key: store_based_barrier_key:1 to store
for rank: 0
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Rank 0: Completed store-based barrier for
key:store_based_barrier_key:1 with 1 nodes.
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Added key: store_based_barrier_key:2 to store
for rank: 0
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Rank 0: Completed store-based barrier for
key:store_based_barrier_key:2 with 1 nodes.
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Added key: store_based_barrier_key:3 to store
for rank: 0
...
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Rank 0: Completed store-based barrier for
key:store_based_barrier_key:5 with 1 nodes.
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Added key: store_based_barrier_key:6 to store
for rank: 0
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Rank 0: Completed store-based barrier for
key:store_based_barrier_key:6 with 1 nodes.
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Added key: store_based_barrier_key:7 to store
for rank: 0
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Rank 0: Completed store-based barrier for
key:store_based_barrier_key:7 with 1 nodes.
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Added key: store_based_barrier_key:8 to store
for rank: 0
INFO colossalai - torch.distributed.distributed_c10d -
INFO: Rank 0: Completed store-based barrier for
key:store_based_barrier_key:8 with 1 nodes.
INFO colossalai - colossalai - INFO: /home/user/softw
are/python/anaconda/anaconda3/envs/conda-general/li
b/python3.10/site-packages/colossalai/context/paral
lel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[06/16/22 13:30:43] INFO colossalai - colossalai - INFO: /home/user/softw
are/python/anaconda/anaconda3/envs/conda-general/li
b/python3.10/site-packages/colossalai/context/paral
lel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on
rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/user/softw
are/python/anaconda/anaconda3/envs/conda-general/li
b/python3.10/site-packages/colossalai/initialize.py
:117 launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1,
pipeline parallel size: 1, tensor parallel size: 1
Files already downloaded and verified
[06/16/22 13:30:44] INFO colossalai - colossalai - INFO: /home/user/softw
are/python/anaconda/anaconda3/envs/conda-general/li
b/python3.10/site-packages/colossalai/initialize.py
:266 initialize
INFO colossalai - colossalai - INFO:
========== Your Config ========
{'BATCH_SIZE': 16384,
'CONFIG': {'fp16': {'mode': <AMP_TYPE.TORCH:
'torch'>}},
'NUM_EPOCHS': 200}
================================
INFO colossalai - colossalai - INFO: /home/user/softw
are/python/anaconda/anaconda3/envs/conda-general/li
b/python3.10/site-packages/colossalai/initialize.py
:278 initialize
INFO colossalai - colossalai - INFO: cuDNN benchmark =
True, deterministic = False
WARNING colossalai - colossalai - WARNING: /home/user/so
ftware/python/anaconda/anaconda3/envs/conda-general
/lib/python3.10/site-packages/colossalai/initialize
.py:304 initialize
WARNING colossalai - colossalai - WARNING: Initializing an
non ZeRO model with optimizer class
WARNING colossalai - colossalai - WARNING: /home/user/so
ftware/python/anaconda/anaconda3/envs/conda-general
/lib/python3.10/site-packages/colossalai/initialize
.py:436 initialize
WARNING colossalai - colossalai - WARNING: No PyTorch DDP
or gradient handler is set up, please make sure you
do not need to all-reduce the gradients after a
training step.
25%|██▌ | 1/4 [00:05<00:16, 5.59s/it]
Traceback (most recent call last):
File "/home/user/research/Experiments/ColossalAI-Examples/image/resnet/train.py", line 157, in <module>
main()
File "/home/user/research/Experiments/ColossalAI-Examples/image/resnet/train.py", line 103, in main
output = engine(img)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
return self.model(*args, **kwargs)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 156, in forward
return self._forward_impl(x)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 147, in _forward_impl
x = self.stage2(x)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 85, in forward
out = torch.cat((x1, self.branch2(x2)), dim=1)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 447, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 10.76 GiB total capacity; 9.54 GiB already allocated; 9.00 MiB free; 9.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2549731) of binary: /home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/python
Fatal Python error: Segmentation fault
Thread 0x00007ff209a3e700 (most recent call first):
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 324 in wait
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 600 in wait
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 254 in _run
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 946 in run
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 1009 in _bootstrap_inner
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 966 in _bootstrap
Current thread 0x00007ff2e1d5a740 (most recent call first):
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in __call__
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/torchrun", line 33 in <module>
Extension modules: torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg.lapack_lite, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 22)
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py on 127.0.0.1
Environment
CUDA: 11.4
Hi, could you provide your training code for us to reproduce this bug? Besides, could you double-check your dataset settings?
I have tried our code with a simple change of model from resnet to shufflenet. It takes about 32521MiB withBATCH_SIZE = 16384, and no OOM occurred.
Hi, @BoxiangW, here is the script as train.py
Hi @songyuc, you can uninstall your current colossalai and install our latest version with
git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt
# install colossalai
pip install .
There was a bug in previous release that takes up extra GPU memory. With our latest version, BATCH_SIZE=16384 only takes about 10605MiB. Hope this could solve your issue.
Hi @songyuc, you can uninstall your current
colossalaiand install our latest version withgit clone https://github.com/hpcaitech/ColossalAI.git cd ColossalAI # install dependency pip install -r requirements/requirements.txt # install colossalai pip install .There was a bug in previous release that takes up extra GPU memory. With our latest version,
BATCH_SIZE=16384only takes about 10605MiB. Hope this could solve your issue.
Thank you for the guide! I will try it later.