[BUG]: `please init model in the ColoInitContext` when using `"colossalai"` training strategy in Lightning AI
π Describe the bug
When training a model with the colossalai strategy in PyTorch Lightning, training will not proceed due to an assertion, AssertionError: please init model in the ColoInitContext.
This occurs in code similar to the following:
import pytorch_lightning as pl
import torch
import torch.utils.data
from torch.utils.data import DataLoader
# Create model
model = ModelModel(modelConfig)
# Prepare dataset
dataset = DatasetDataset(datasetConfig)
training_set, validation_set = torch.utils.data.random_split(dataset, [int(len(dataset)*0.8), len(dataset) - int(len(dataset)*0.8)])
# Filter out bad entries in advance
train_loader = DataLoader(training_set, batch_size=1, collate_fn=dirty_collate)
val_loader = DataLoader(validation_set, batch_size=1, collate_fn=dirty_collate)
# Define trainer
trainer = pl.Trainer(
max_steps=1,
accelerator=device,
strategy="colossalai",
precision=16,
limit_train_batches=0.5,
accumulate_grad_batches=1)
# Start training loop
while True:
trainer.fit(model, train_loader, val_loader)
The following is the stack trace produced by running a similar script.
Traceback (most recent call last):
File "/mnt/e/Source/train.py", line 71, in train
trainer.fit(model, train_loader, val_loader)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
return function(*args, **kwargs)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run
self.strategy.setup(self)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/colossalai.py", line 339, in setup
self.setup_precision_plugin()
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/colossalai.py", line 278, in setup_precision_plugin
self.model = GeminiDDP(
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/nn/parallel/gemini_parallel.py", line 56, in __init__
chunk_manager = init_chunk_manager(model=module,
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/utils.py", line 32, in init_chunk_manager
config_dict, total_size, wasted_size = search_chunk_configuration(model, **kwargs)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/search_utils.py", line 121, in search_chunk_configuration
params_dict = classify_params_by_dp_degree(param_order, strict_ddp_flag)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/search_utils.py", line 70, in classify_params_by_dp_degree
assert isinstance(param, ColoParameter), "please init model in the ColoInitContext"
AssertionError: please init model in the ColoInitContext
Issue is a repost of https://github.com/Lightning-AI/lightning/issues/16824.
Environment
Current environment
* CUDA:
- GPU:
- NVIDIA GeForce RTX 3090
- available: True
- version: 11.7
* Lightning:
- lightning-utilities: 0.6.0.post0
- open-clip-torch: 2.14.0
- pytorch-lightning: 1.9.2
- torch: 1.13.1
- torchaudio: 0.13.1
- torchmetrics: 0.11.1
- torchvision: 0.14.1
* Packages:
- absl-py: 1.4.0
- aiohttp: 3.8.4
- aiosignal: 1.3.1
- async-timeout: 4.0.2
- attrs: 22.2.0
- bcrypt: 4.0.1
- cachetools: 5.3.0
- certifi: 2022.12.7
- cffi: 1.15.1
- cfgv: 3.3.1
- charset-normalizer: 3.0.1
- click: 8.1.3
- cmake: 3.25.2
- colossalai: 0.2.5
- contexttimer: 0.3.3
- contourpy: 1.0.7
- cryptography: 39.0.1
- cycler: 0.11.0
- diffusers: 0.13.1
- distlib: 0.3.6
- fabric: 3.0.0
- filelock: 3.9.0
- fonttools: 4.38.0
- frozenlist: 1.3.3
- fsspec: 2023.1.0
- ftfy: 6.1.1
- huggingface-hub: 0.12.1
- identify: 2.5.18
- idna: 3.4
- importlib-metadata: 6.0.0
- invoke: 2.0.0
- kiwisolver: 1.4.4
- lightning-utilities: 0.6.0.post0
- lit: 15.0.7
- markdown: 3.4.1
- markdown-it-py: 2.1.0
- markupsafe: 2.1.2
- matplotlib: 3.7.0
- mdurl: 0.1.2
- multidict: 6.0.4
- mypy-extensions: 1.0.0
- ninja: 1.11.1
- nodeenv: 1.7.0
- numpy: 1.24.2
- nvidia-cublas-cu11: 11.10.3.66
- nvidia-cuda-nvrtc-cu11: 11.7.99
- nvidia-cuda-runtime-cu11: 11.7.99
- nvidia-cudnn-cu11: 8.5.0.96
- oauthlib: 3.2.2
- open-clip-torch: 2.14.0
- packaging: 23.0
- paramiko: 3.0.0
- pillow: 9.4.0
- pip: 22.3.1
- platformdirs: 3.0.0
- pre-commit: 3.0.4
- protobuf: 3.20.3
- psutil: 5.9.4
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pycparser: 2.21
- pygments: 2.14.0
- pynacl: 1.5.0
- pyparsing: 3.0.9
- pyre-extensions: 0.0.23
- python-dateutil: 2.8.2
- pytorch-lightning: 1.9.2
- pyyaml: 6.0
- regex: 2022.10.31
- requests: 2.28.2
- requests-oauthlib: 1.3.1
- rich: 13.3.1
- rsa: 4.9
- sentencepiece: 0.1.97
- setuptools: 65.6.3
- six: 1.16.0
- tensorboard: 2.12.0
- tensorboard-data-server: 0.7.0
- tensorboard-plugin-wit: 1.8.1
- timm: 0.6.12
- tokenizers: 0.13.2
- torch: 1.13.1
- torchaudio: 0.13.1
- torchmetrics: 0.11.1
- torchvision: 0.14.1
- tqdm: 4.64.1
- transformers: 4.26.1
- triton: 2.0.0a2
- typing-extensions: 4.5.0
- typing-inspect: 0.8.0
- urllib3: 1.26.14
- virtualenv: 20.19.0
- wcwidth: 0.2.6
- werkzeug: 2.2.3
- wheel: 0.38.4
- xformers: 0.0.16
- yarl: 1.8.2
- zipp: 3.14.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.9
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Title: [BUG]:
Can you try following the example here and initiate your model under the ColoInitContext? 'Colossalai' strategy would require your model to have coloparameters.
Thank you for your response, @JThh! I've gone ahead to initiate the model under a ColoInitContext as following.
import pytorch_lightning as pl
import torch
import torch.utils.data
from torch.utils.data import DataLoader
from colossalai.utils import get_current_device
from colossalai.utils.model.colo_init_context import ColoInitContext
# Create model
init_dev = get_current_device()
with ColoInitContext(device=init_dev, dtype=torch.half):
model = ModelModel(modelConfig)
# Prepare dataset
dataset = DatasetDataset(datasetConfig)
training_set, validation_set = torch.utils.data.random_split(dataset, [int(len(dataset)*0.8), len(dataset) - int(len(dataset)*0.8)])
# Filter out bad entries in advance
train_loader = DataLoader(training_set, batch_size=1, collate_fn=dirty_collate)
val_loader = DataLoader(validation_set, batch_size=1, collate_fn=dirty_collate)
# Define trainer
trainer = pl.Trainer(
max_steps=1,
accelerator=device,
strategy="colossalai",
precision=16,
limit_train_batches=0.5,
accumulate_grad_batches=1)
# Start training loop
while True:
trainer.fit(model, train_loader, val_loader)
But I found myself stuck with another error, regarding dp_rank_list. Unsure if this is related to Lightning integration, but I can create a new issue for atomicity if needed.
File "/mnt/e/Source/train.py", line 76
trainer.fit(model, train_loader, val_loader)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
return function(*args, **kwargs)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run
self.strategy.setup(self)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/colossalai.py", line 339, in setup
self.setup_precision_plugin()
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/pytorch_lightning/strategies/colossalai.py", line 278, in setup_precision_plugin
self.model = GeminiDDP(
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/nn/parallel/gemini_parallel.py", line 56, in __init__
chunk_manager = init_chunk_manager(model=module,
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/utils.py", line 32, in init_chunk_manager
config_dict, total_size, wasted_size = search_chunk_configuration(model, **kwargs)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/search_utils.py", line 121, in search_chunk_configuration
params_dict = classify_params_by_dp_degree(param_order, strict_ddp_flag)
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/gemini/chunk/search_utils.py", line 77, in classify_params_by_dp_degree
param_key = param.process_group.dp_world_size()
File "/home/user/miniconda3/envs/env/lib/python3.10/site-packages/colossalai/tensor/process_group.py", line 243, in dp_world_size
return len(self._dp_rank_list)
AttributeError: 'ProcessGroup' object has no attribute '_dp_rank_list'. Did you mean: 'dp_rank_list'?
Actually have the same issue. First the AssertionError: please init model in the ColoInitContext issue, then, when following the suggestions, the AttributeError: 'ProcessGroup' object has no attribute '_dp_rank_list'. Did you mean: 'dp_rank_list'?
In Lightning, configuring each module happens under configure_sharded_model instead of at the __init__ constructor for both colossal and other multi-accelerator training regimes, as advised in https://github.com/Lightning-AI/lightning/issues/16824.
refer to https://github.com/hpcaitech/ColossalAI/blob/dca98937f834f5af2730f481bf6f5e5eee844742/examples/images/diffusion/ldm/models/diffusion/ddpm.py#L448
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Solved in #2909 Thanks.