ColossalAI-Examples
ColossalAI-Examples copied to clipboard
Vision Transformer cifar10 bug
🐛 Describe the bug
When I run a vit experiment by the following command
node=76
prefix="srun --nodes=1 --gres=gpu:4 --cpus-per-task=4 --ntasks=1 -w SG-IDC1-10-51-2-$node"
$prefix colossalai run --nproc_per_node 4 train_with_cifar10.py --config configs/vit_1d_tp2_pp2.py --host=10.51.2.$node
I got
tensor shape 128
Traceback (most recent call last):
File "train_with_cifar10.py", line 122, in <module>
tensor shape 128
Traceback (most recent call last):
File "train_with_cifar10.py", line 122, in <module>
main()
File "train_with_cifar10.py", line 116, in main
main()
File "train_with_cifar10.py", line 116, in main
engine.execute_schedule(data_iter, return_output_label=False)
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 303, in forward_backward_step
engine.execute_schedule(data_iter, return_output_label=False)
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 303, in forward_backward_step
input_tensor = comm.recv_forward(ft_shape,
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 194, in recv_forward
input_tensor = comm.recv_forward(ft_shape,
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 194, in recv_forward
input_tensor, _ = _communicate(recv_prev=True,
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 119, in _communicate
input_tensor, _ = _communicate(recv_prev=True,
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 119, in _communicate
tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(recv_prev_shape, dtype,
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 49, in create_recv_buffer_with_shapes
tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(recv_prev_shape, dtype,
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 49, in create_recv_buffer_with_shapes
recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 30, in _get_tensor_shape
recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 30, in _get_tensor_shape
tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
TypeError: reduce() arg 2 must support iteration
tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
TypeError: reduce() arg 2 must support iteration
Environment
I install ColossalAI via
pip install colossalai==0.1.6+torch1.10cu10.2 -f https://release.colossalai.org
Other environment information is collected via this
PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 5.3.0
Clang version: Could not collect
CMake version: version 3.19.3
Libc version: glibc-2.17
Python version: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
GPU 2: Tesla V100-PCIE-32GB
GPU 3: Tesla V100-PCIE-32GB
GPU 4: Tesla V100-PCIE-32GB
GPU 5: Tesla V100-PCIE-32GB
GPU 6: Tesla V100-PCIE-32GB
GPU 7: Tesla V100-PCIE-32GB
Nvidia driver version: 470.63.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] colossalai==0.1.6+torch1.10cu10.2
[pip3] numpy==1.22.4
[pip3] torch==1.11.0
[pip3] torchvision==0.12.0
[conda] colossalai 0.1.6+torch1.10cu10.2 pypi_0 pypi
[conda] numpy 1.22.4 pypi_0 pypi
[conda] torch 1.11.0 pypi_0 pypi
[conda] torchvision 0.12.0 pypi_0 pypi
``
I got the same problem. And if I change the config file to vit_pipeline.py, the error will be :
TypeError: layer_norm(): argument 'input' (position 1) must be Tensor, not list
https://github.com/hpcaitech/ColossalAI/pull/1100 This PR resolved related bugs. You can try again with the lastest main branch code.
Thanks, Liu. I pulled the latest codes of ColossalAI and ColossalAi-Examples, then I got another error about titans
:
Traceback (most recent call last):
File "train_with_cifar10.py", line 13, in <module>
from titans.model.vit.vit import _create_vit_model
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/__init__.py", line 3, in <module>
from . import model
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/__init__.py", line 2, in <module>
from . import gpt
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/gpt/__init__.py", line 1, in <module>
from .gpt import *
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/gpt/gpt.py", line 6, in <module>
from colossalai.builder.pipeline import partition_uniform
ModuleNotFoundError: No module named 'colossalai.builder.pipeline'
Even if I solved this problem, I got another problem from titans
:
Traceback (most recent call last):
File "train_with_cifar10.py", line 119, in <module>
main()
File "train_with_cifar10.py", line 54, in main
model = _create_vit_model(**model_kwargs)
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/vit/vit.py", line 103, in _create_vit_model
model = VisionTransformer(**model_kwargs)
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/model/utils.py", line 52, in wrapper
f(module, *args, **kwargs)
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/decorator/no_support.py", line 57, in new_init
origin_init(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'hidden_size'
I think your problem will be resolved by pulling the lastest codes of Titans as well. Sorry about the unstable APIs, we will improve related issues in future release.
Thanks, Liu. The problem was solved by reinstalling titans
. But the training process will be stuck at the 86/196
step.
I used 4 A6000 GPUs with colossalai run --nproc_per_node 4 train_with_cifar10.py --config configs/vit_1d_tp2_pp2.py
Hi @edwardhorp Thank you for your feedback, we have located the reason and are working on it. We will let you know once it is fixed!
The reason for training process stuck is that different pipeline stage got different overflow status, if the overflow rank do not join the clip grad norm, the all reduce process will be stuck. This bug has been fixed in PR(https://github.com/hpcaitech/ColossalAI/pull/1175).