ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: ZeRO无法使用预训练权重

Open bobo0810 opened this issue 2 years ago • 3 comments

🐛 Describe the bug

示例代码 https://github.com/hpcaitech/ColossalAI/blob/v0.2.5/tests/test_zero/test_zero_engine.py

resnet18(num_classes=10)✅ resnet18( num_classes=10,pretrained=True)❌

File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 309, in resnet18
    return _resnet("resnet18", BasicBlock, [2, 2, 2, 2], pretrained, progress, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 297, in _resnet
    model.load_state_dict(state_dict)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1476, in load_state_dict
    load(self)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1474, in load
    load(child, prefix + name + '.')
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1470, in load
    module._load_from_state_dict(
  File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/sharded_model/sharded_model_v2.py", line 509, in _colo_load_from_state_dict
    param.colo_attr.data_payload_reset(
  File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/sharded_param/sharded_param.py", line 67, in data_payload_reset
    assert tensor.requires_grad is False
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 50403) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0a0+17540c5', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Environment

colossalai 0.2.5 torch 1.11.0a0+17540c5 nvidia-dali-cuda110 1.10.0 CUDA Version: 11.6

bobo0810 avatar Mar 07 '23 07:03 bobo0810

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: [BUG]: ZeRO cannot use pre-trained weights

Issues-translate-bot avatar Mar 07 '23 07:03 Issues-translate-bot

I believe it is only a test script and might not be intended to be fully functional. Can you try this example to test out ZeRO?

JThh avatar Mar 09 '23 03:03 JThh

I believe it is only a test script and might not be intended to be fully functional. Can you try this example to test out ZeRO?

Thanks

bobo0810 avatar Mar 09 '23 10:03 bobo0810

This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 27 '23 10:04 binmakeswell