ColossalAI
ColossalAI copied to clipboard
[BUG]: metaclass conflict
🐛 Describe the bug
colossalai 使用pytorch 1.13.1时提示元类冲突
命令 CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 pretrain.py
log:
Traceback (most recent call last):
File "pretrain.py", line 66, in <module>
from strategies import DDPStrategy, NaiveStrategy, ColossalAIStrategy
File "/home/kangjiawei/work/sscp/strategies/__init__.py", line 2, in <module>
from .colossalai import ColossalAIStrategy
File "/home/kangjiawei/work/sscp/strategies/colossalai.py", line 14, in <module>
import colossalai
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/__init__.py", line 1, in <module>
from .initialize import (
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/initialize.py", line 18, in <module>
from colossalai.amp import AMP_TYPE, convert_to_amp
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/amp/__init__.py", line 5, in <module>
from colossalai.context import Config
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/context/__init__.py", line 2, in <module>
from .parallel_context import ParallelContext
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/context/parallel_context.py", line 17, in <module>
from colossalai.registry import DIST_GROUP_INITIALIZER
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/registry/__init__.py", line 1, in <module>
import torch.distributed.optim as dist_optim
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/optim/__init__.py", line 28, in <module>
from .zero_redundancy_optimizer import ZeroRedundancyOptimizer
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/optim/zero_redundancy_optimizer.py", line 273, in <module>
class ZeroRedundancyOptimizer(Optimizer, Joinable):
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 76694) of binary: /home/kangjiawei/miniconda3/envs/coati/bin/python
Traceback (most recent call last):
File "/home/kangjiawei/miniconda3/envs/coati/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
Environment
colossalai 0.2.8 coati 1.0.0 gpustat 1.1 torch 1.13.1 torchdrug 0.2.0 transformers 4.27.4
hii, @kkangjiawei could you please share more information about this problem, like which kind of task you are performing in pretrain.py?