ColossalAI
ColossalAI copied to clipboard
[BUG]: ZeRO无法使用预训练权重
🐛 Describe the bug
示例代码 https://github.com/hpcaitech/ColossalAI/blob/v0.2.5/tests/test_zero/test_zero_engine.py
resnet18(num_classes=10)✅ resnet18( num_classes=10,pretrained=True)❌
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 309, in resnet18
return _resnet("resnet18", BasicBlock, [2, 2, 2, 2], pretrained, progress, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 297, in _resnet
model.load_state_dict(state_dict)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1476, in load_state_dict
load(self)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1474, in load
load(child, prefix + name + '.')
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1470, in load
module._load_from_state_dict(
File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/sharded_model/sharded_model_v2.py", line 509, in _colo_load_from_state_dict
param.colo_attr.data_payload_reset(
File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/sharded_param/sharded_param.py", line 67, in data_payload_reset
assert tensor.requires_grad is False
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 50403) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0a0+17540c5', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Environment
colossalai 0.2.5 torch 1.11.0a0+17540c5 nvidia-dali-cuda110 1.10.0 CUDA Version: 11.6
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: [BUG]: ZeRO cannot use pre-trained weights
I believe it is only a test script and might not be intended to be fully functional. Can you try this example to test out ZeRO?
I believe it is only a test script and might not be intended to be fully functional. Can you try this example to test out ZeRO?
Thanks
This issue was closed due to inactivity. Thanks.