ColossalAI
ColossalAI copied to clipboard
I encountered a bug on importing "coati" while running "sh train_sft.sh" in "ColossalAI/applications/Chat/examples"[BUG]:
🐛 Describe the bug
ColossalAI/applications/Chat/examples$ sh train_sft.sh WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
registered at aten/src/ATen/RegisterSchema.cpp:6
dispatch key: Meta
previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
new kernel: registered at /dev/null:219 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
self.m.impl(name, dispatch_key, fn)
/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
registered at aten/src/ATen/RegisterSchema.cpp:6
dispatch key: Meta
previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
new kernel: registered at /dev/null:219 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
self.m.impl(name, dispatch_key, fn)
/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/amp/autocast_mode.py:202: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of 'cuda', but CUDA is not available. Disabling')
/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/amp/autocast_mode.py:202: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of 'cuda', but CUDA is not available. Disabling')
/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
registered at aten/src/ATen/RegisterSchema.cpp:6
dispatch key: Meta
previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
new kernel: registered at /dev/null:219 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
self.m.impl(name, dispatch_key, fn)
/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
registered at aten/src/ATen/RegisterSchema.cpp:6
dispatch key: Meta
previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
new kernel: registered at /dev/null:219 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
self.m.impl(name, dispatch_key, fn)
/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/amp/autocast_mode.py:202: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of 'cuda', but CUDA is not available. Disabling')
/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/amp/autocast_mode.py:202: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of 'cuda', but CUDA is not available. Disabling')
Traceback (most recent call last):
File "train_sft.py", line 13, in
from coati.trainer import SFTTrainer
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/init.py", line 1, in
from .base import Trainer
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/base.py", line 7, in
from .callbacks import Callback
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/callbacks/init.py", line 3, in
from .save_checkpoint import SaveCheckpoint
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/callbacks/save_checkpoint.py", line 4, in
from coati.trainer.strategies import ColossalAIStrategy, Strategy
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/strategies/init.py", line 2, in
from .colossalai import ColossalAIStrategy
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/strategies/colossalai.py", line 19, in
from colossalai.zero import ColoInitContext, ZeroDDP, zero_model_wrapper, zero_optim_wrapper
ImportError: cannot import name 'ColoInitContext' from 'colossalai.zero' (/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/colossalai/zero/init.py)
Traceback (most recent call last):
File "train_sft.py", line 13, in
from coati.trainer import SFTTrainer
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/init.py", line 1, in
from .base import Trainer
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/base.py", line 7, in
from .callbacks import Callback
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/callbacks/init.py", line 3, in
from .save_checkpoint import SaveCheckpoint
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/callbacks/save_checkpoint.py", line 4, in
from coati.trainer.strategies import ColossalAIStrategy, Strategy
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/strategies/init.py", line 2, in
from .colossalai import ColossalAIStrategy
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/strategies/colossalai.py", line 19, in
from colossalai.zero import ColoInitContext, ZeroDDP, zero_model_wrapper, zero_optim_wrapper
ImportError: cannot import name 'ColoInitContext' from 'colossalai.zero' (/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/colossalai/zero/init.py)
Traceback (most recent call last):
File "train_sft.py", line 13, in
from coati.trainer import SFTTrainer
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/init.py", line 1, in
from .base import Trainer
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/base.py", line 7, in
from .callbacks import Callback
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/callbacks/init.py", line 3, in
from .save_checkpoint import SaveCheckpoint
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/callbacks/save_checkpoint.py", line 4, in
from coati.trainer.strategies import ColossalAIStrategy, Strategy
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/strategies/init.py", line 2, in
from .colossalai import ColossalAIStrategy
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/strategies/colossalai.py", line 19, in
from colossalai.zero import ColoInitContext, ZeroDDP, zero_model_wrapper, zero_optim_wrapper
ImportError: cannot import name 'ColoInitContext' from 'colossalai.zero' (/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/colossalai/zero/init.py)
Traceback (most recent call last):
File "train_sft.py", line 13, in
from coati.trainer import SFTTrainer
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/init.py", line 1, in
from .base import Trainer
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/base.py", line 7, in
from .callbacks import Callback
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/callbacks/init.py", line 3, in
from .save_checkpoint import SaveCheckpoint
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/callbacks/save_checkpoint.py", line 4, in
from coati.trainer.strategies import ColossalAIStrategy, Strategy
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/strategies/init.py", line 2, in
from .colossalai import ColossalAIStrategy
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/coati/trainer/strategies/colossalai.py", line 19, in
from colossalai.zero import ColoInitContext, ZeroDDP, zero_model_wrapper, zero_optim_wrapper
ImportError: cannot import name 'ColoInitContext' from 'colossalai.zero' (/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/colossalai/zero/init.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24269) of binary: /home/kk_199/anaconda3/envs/colossai/bin/python
Traceback (most recent call last):
File "/home/kk_199/anaconda3/envs/colossai/bin/torchrun", line 8, in
sys.exit(main())
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/kk_199/anaconda3/envs/colossai/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_sft.py FAILED
Failures: [1]: time : 2023-04-20_07:09:53 host : LAPTOP-G037B735.localdomain rank : 1 (local_rank: 1) exitcode : 1 (pid: 24270) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-04-20_07:09:53 host : LAPTOP-G037B735.localdomain rank : 2 (local_rank: 2) exitcode : 1 (pid: 24272) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-04-20_07:09:53 host : LAPTOP-G037B735.localdomain rank : 3 (local_rank: 3) exitcode : 1 (pid: 24274) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-04-20_07:09:53 host : LAPTOP-G037B735.localdomain rank : 0 (local_rank: 0) exitcode : 1 (pid: 24269) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Environment
aiohttp 3.8.4 aiosignal 1.3.1 anyio 3.6.2 appdirs 1.4.4 async-timeout 4.0.2 attrs 23.1.0 bcrypt 4.0.1 blessed 1.20.0 Bottleneck 1.3.5 certifi 2022.12.7 cffi 1.15.1 cfgv 3.3.1 charset-normalizer 3.1.0 chatgpt 2.2212.0 click 8.1.3 cmake 3.26.1 coati 1.0.0 colossalai 0.2.7 contexttimer 0.3.3 cryptography 40.0.1 dataclasses-json 0.5.7 datasets 2.11.0 dill 0.3.6 distlib 0.3.6 docker-pycreds 0.4.0 fabric 3.0.0 fastapi 0.95.1 filelock 3.9.0 frozenlist 1.3.3 fsspec 2023.4.0 gitdb 4.0.10 GitPython 3.1.31 gpustat 1.1 greenlet 2.0.2 huggingface-hub 0.13.3 identify 2.5.22 idna 3.4 invoke 2.0.0 Jinja2 3.1.2 langchain 0.0.144 lit 16.0.0 loralib 0.1.1 markdown-it-py 2.2.0 MarkupSafe 2.1.2 marshmallow 3.19.0 marshmallow-enum 1.5.1 mdurl 0.1.2 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 mypy-extensions 1.0.0 networkx 3.0 ninja 1.11.1 nodeenv 1.7.0 numexpr 2.8.4 numpy 1.23.5 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-ml-py 11.525.112 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 openapi-schema-pydantic 1.2.4 packaging 22.0 pandas 1.5.3 paramiko 3.1.0 pathtools 0.1.2 Pillow 9.4.0 pip 23.0.1 platformdirs 2.5.2 pre-commit 3.2.1 protobuf 4.22.3 psutil 5.9.4 pyarrow 11.0.0 pycparser 2.21 pydantic 1.10.7 Pygments 2.14.0 PyNaCl 1.5.0 python-dateutil 2.8.2 pytz 2022.7 PyYAML 6.0 regex 2023.3.23 requests 2.28.2 responses 0.18.0 rich 13.3.3 sentencepiece 0.1.98 sentry-sdk 1.20.0 setproctitle 1.3.2 setuptools 65.6.3 six 1.16.0 smmap 5.0.0 sniffio 1.3.0 SQLAlchemy 1.4.47 sse-starlette 1.3.4 starlette 0.26.1 sympy 1.11.1 tenacity 8.2.2 tls-client 0.1.9 tokenizers 0.13.2 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.65.0 transformers 4.28.0.dev0 triton 2.0.0 typing_extensions 4.5.0 typing-inspect 0.8.0 urllib3 1.26.15 virtualenv 20.17.1 wandb 0.15.0 wcwidth 0.2.6 wheel 0.38.4 xxhash 3.2.0 yarl 1.8.2
me too
#3447 related issue. This may solve it.