ColossalAI
                                
                                 ColossalAI copied to clipboard
                                
                                    ColossalAI copied to clipboard
                            
                            
                            
                        [BUG]: `please init model in the ColoInitContext` when using `"colossalai"` training strategy in Lightning AI
π Describe the bug
When training a model with the colossalai strategy in PyTorch Lightning, training will not proceed due to an assertion, AssertionError: please init model in the ColoInitContext.
This occurs in code similar to the following:
import pytorch_lightning as pl
import torch
import torch.utils.data
from torch.utils.data import DataLoader
# Create model
model = ModelModel(modelConfig)
# Prepare dataset
dataset = DatasetDataset(datasetConfig)
training_set, validation_set = torch.utils.data.random_split(dataset, [int(len(dataset)*0.8), len(dataset) - int(len(dataset)*0.8)])
# Filter out bad entries in advance
train_loader = DataLoader(training_set, batch_size=1, collate_fn=dirty_collate)
val_loader = DataLoader(validation_set, batch_size=1, collate_fn=dirty_collate)
# Define trainer
trainer = pl.Trainer(
    max_steps=1,
    accelerator=device, 
    strategy="colossalai",
    precision=16, 
    limit_train_batches=0.5,
    accumulate_grad_batches=1)
# Start training loop
while True:
    trainer.fit(model, train_loader, val_loader)
The following is the stack trace produced by running a similar script.
Traceback (most recent call last):
  File "/mnt/e/Source/train.py", line 71, in train
    trainer.fit(model, train_loader, val_loader)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
    return function(*args, **kwargs)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run
    self.strategy.setup(self)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/colossalai.py", line 339, in setup
    self.setup_precision_plugin()
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/colossalai.py", line 278, in setup_precision_plugin
    self.model = GeminiDDP(
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/nn/parallel/gemini_parallel.py", line 56, in __init__
    chunk_manager = init_chunk_manager(model=module,
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/utils.py", line 32, in init_chunk_manager
    config_dict, total_size, wasted_size = search_chunk_configuration(model, **kwargs)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/search_utils.py", line 121, in search_chunk_configuration
    params_dict = classify_params_by_dp_degree(param_order, strict_ddp_flag)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/search_utils.py", line 70, in classify_params_by_dp_degree
    assert isinstance(param, ColoParameter), "please init model in the ColoInitContext"
AssertionError: please init model in the ColoInitContext
Issue is a repost of https://github.com/Lightning-AI/lightning/issues/16824.
Environment
Current environment
* CUDA:
        - GPU:
                - NVIDIA GeForce RTX 3090
        - available:         True
        - version:           11.7
* Lightning:
        - lightning-utilities: 0.6.0.post0
        - open-clip-torch:   2.14.0
        - pytorch-lightning: 1.9.2
        - torch:             1.13.1
        - torchaudio:        0.13.1
        - torchmetrics:      0.11.1
        - torchvision:       0.14.1
* Packages:
        - absl-py:           1.4.0
        - aiohttp:           3.8.4
        - aiosignal:         1.3.1
        - async-timeout:     4.0.2
        - attrs:             22.2.0
        - bcrypt:            4.0.1
        - cachetools:        5.3.0
        - certifi:           2022.12.7
        - cffi:              1.15.1
        - cfgv:              3.3.1
        - charset-normalizer: 3.0.1
        - click:             8.1.3
        - cmake:             3.25.2
        - colossalai:        0.2.5
        - contexttimer:      0.3.3
        - contourpy:         1.0.7
        - cryptography:      39.0.1
        - cycler:            0.11.0
        - diffusers:         0.13.1
        - distlib:           0.3.6
        - fabric:            3.0.0
        - filelock:          3.9.0
        - fonttools:         4.38.0
        - frozenlist:        1.3.3
        - fsspec:            2023.1.0
        - ftfy:              6.1.1
        - huggingface-hub:   0.12.1
        - identify:          2.5.18
        - idna:              3.4
        - importlib-metadata: 6.0.0
        - invoke:            2.0.0
        - kiwisolver:        1.4.4
        - lightning-utilities: 0.6.0.post0
        - lit:               15.0.7
        - markdown:          3.4.1
        - markdown-it-py:    2.1.0
        - markupsafe:        2.1.2
        - matplotlib:        3.7.0
        - mdurl:             0.1.2
        - multidict:         6.0.4
        - mypy-extensions:   1.0.0
        - ninja:             1.11.1
        - nodeenv:           1.7.0
        - numpy:             1.24.2
        - nvidia-cublas-cu11: 11.10.3.66
        - nvidia-cuda-nvrtc-cu11: 11.7.99
        - nvidia-cuda-runtime-cu11: 11.7.99
        - nvidia-cudnn-cu11: 8.5.0.96
        - oauthlib:          3.2.2
        - open-clip-torch:   2.14.0
        - packaging:         23.0
        - paramiko:          3.0.0
        - pillow:            9.4.0
        - pip:               22.3.1
        - platformdirs:      3.0.0
        - pre-commit:        3.0.4
        - protobuf:          3.20.3
        - psutil:            5.9.4
        - pyasn1:            0.4.8
        - pyasn1-modules:    0.2.8
        - pycparser:         2.21
        - pygments:          2.14.0
        - pynacl:            1.5.0
        - pyparsing:         3.0.9
        - pyre-extensions:   0.0.23
        - python-dateutil:   2.8.2
        - pytorch-lightning: 1.9.2
        - pyyaml:            6.0
        - regex:             2022.10.31
        - requests:          2.28.2
        - requests-oauthlib: 1.3.1
        - rich:              13.3.1
        - rsa:               4.9
        - sentencepiece:     0.1.97
        - setuptools:        65.6.3
        - six:               1.16.0
        - tensorboard:       2.12.0
        - tensorboard-data-server: 0.7.0
        - tensorboard-plugin-wit: 1.8.1
        - timm:              0.6.12
        - tokenizers:        0.13.2
        - torch:             1.13.1
        - torchaudio:        0.13.1
        - torchmetrics:      0.11.1
        - torchvision:       0.14.1
        - tqdm:              4.64.1
        - transformers:      4.26.1
        - triton:            2.0.0a2
        - typing-extensions: 4.5.0
        - typing-inspect:    0.8.0
        - urllib3:           1.26.14
        - virtualenv:        20.19.0
        - wcwidth:           0.2.6
        - werkzeug:          2.2.3
        - wheel:             0.38.4
        - xformers:          0.0.16
        - yarl:              1.8.2
        - zipp:              3.14.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.10.9
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Title: [BUG]:
Can you try following the example here and initiate your model under the ColoInitContext? 'Colossalai' strategy would require your model to have coloparameters.
Thank you for your response, @JThh! I've gone ahead to initiate the model under a ColoInitContext as following.
import pytorch_lightning as pl
import torch
import torch.utils.data
from torch.utils.data import DataLoader
from colossalai.utils import get_current_device
from colossalai.utils.model.colo_init_context import ColoInitContext
# Create model
init_dev = get_current_device()
with ColoInitContext(device=init_dev, dtype=torch.half):
    model = ModelModel(modelConfig)
# Prepare dataset
dataset = DatasetDataset(datasetConfig)
training_set, validation_set = torch.utils.data.random_split(dataset, [int(len(dataset)*0.8), len(dataset) - int(len(dataset)*0.8)])
# Filter out bad entries in advance
train_loader = DataLoader(training_set, batch_size=1, collate_fn=dirty_collate)
val_loader = DataLoader(validation_set, batch_size=1, collate_fn=dirty_collate)
# Define trainer
trainer = pl.Trainer(
    max_steps=1,
    accelerator=device, 
    strategy="colossalai",
    precision=16, 
    limit_train_batches=0.5,
    accumulate_grad_batches=1)
# Start training loop
while True:
    trainer.fit(model, train_loader, val_loader)
But I found myself stuck with another error, regarding dp_rank_list. Unsure if this is related to Lightning integration, but I can create a new issue for atomicity if needed.
  File "/mnt/e/Source/train.py", line 76
    trainer.fit(model, train_loader, val_loader)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
    return function(*args, **kwargs)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run
    self.strategy.setup(self)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/colossalai.py", line 339, in setup
    self.setup_precision_plugin()
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/colossalai.py", line 278, in setup_precision_plugin
    self.model = GeminiDDP(
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/nn/parallel/gemini_parallel.py", line 56, in __init__
    chunk_manager = init_chunk_manager(model=module,
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/utils.py", line 32, in init_chunk_manager
    config_dict, total_size, wasted_size = search_chunk_configuration(model, **kwargs)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/search_utils.py", line 121, in search_chunk_configuration
    params_dict = classify_params_by_dp_degree(param_order, strict_ddp_flag)
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/search_utils.py", line 77, in classify_params_by_dp_degree
    param_key = param.process_group.dp_world_size()
  File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/tensor/process_group.py", line 243, in dp_world_size
    return len(self._dp_rank_list)
AttributeError: 'ProcessGroup' object has no attribute '_dp_rank_list'. Did you mean: 'dp_rank_list'?
Actually have the same issue. First the AssertionError: please init model in the ColoInitContext issue, then, when following the suggestions, the AttributeError: 'ProcessGroup' object has no attribute '_dp_rank_list'. Did you mean: 'dp_rank_list'?
In Lightning, configuring each module happens under configure_sharded_model instead of at the __init__ constructor for both colossal and other multi-accelerator training regimes, as advised in https://github.com/Lightning-AI/lightning/issues/16824.
refer to https://github.com/hpcaitech/ColossalAI/blob/dca98937f834f5af2730f481bf6f5e5eee844742/examples/images/diffusion/ldm/models/diffusion/ddpm.py#L448
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Solved in #2909 Thanks.