ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: metaclass conflict

Open kkangjiawei opened this issue 1 year ago • 1 comments

🐛 Describe the bug

colossalai 使用pytorch 1.13.1时提示元类冲突

命令 CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 pretrain.py

log:

Traceback (most recent call last):
  File "pretrain.py", line 66, in <module>
    from strategies import DDPStrategy, NaiveStrategy, ColossalAIStrategy
  File "/home/kangjiawei/work/sscp/strategies/__init__.py", line 2, in <module>
    from .colossalai import ColossalAIStrategy
  File "/home/kangjiawei/work/sscp/strategies/colossalai.py", line 14, in <module>
    import colossalai
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/__init__.py", line 1, in <module>
    from .initialize import (
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/initialize.py", line 18, in <module>
    from colossalai.amp import AMP_TYPE, convert_to_amp
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/amp/__init__.py", line 5, in <module>
    from colossalai.context import Config
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/context/__init__.py", line 2, in <module>
    from .parallel_context import ParallelContext
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/context/parallel_context.py", line 17, in <module>
    from colossalai.registry import DIST_GROUP_INITIALIZER
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/colossalai/registry/__init__.py", line 1, in <module>
    import torch.distributed.optim as dist_optim
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/optim/__init__.py", line 28, in <module>
    from .zero_redundancy_optimizer import ZeroRedundancyOptimizer
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/optim/zero_redundancy_optimizer.py", line 273, in <module>
    class ZeroRedundancyOptimizer(Optimizer, Joinable):
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 76694) of binary: /home/kangjiawei/miniconda3/envs/coati/bin/python
Traceback (most recent call last):
  File "/home/kangjiawei/miniconda3/envs/coati/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/kangjiawei/miniconda3/envs/coati/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================

Environment

colossalai 0.2.8 coati 1.0.0 gpustat 1.1 torch 1.13.1 torchdrug 0.2.0 transformers 4.27.4

kkangjiawei avatar Apr 12 '23 05:04 kkangjiawei

hii, @kkangjiawei could you please share more information about this problem, like which kind of task you are performing in pretrain.py?

Camille7777 avatar Apr 17 '23 08:04 Camille7777