Swin-Transformer icon indicating copy to clipboard operation
Swin-Transformer copied to clipboard

Load single-GPU trained parameters to multi-GPU inference (SwinMoE)

Open ranggihwang opened this issue 2 years ago • 1 comments

Hello.

I'm trying to upload a single-GPU trained SwinMoE model to multi-GPU (4) inference.

I'm adopting 8 experts for an MoE layer.

It seems that my checkpoint file has all parameters for 8 MoE layers at once, but each GPU requires 2 MoE layers' parameters, so there's a mismatch.

Is there any way to fix it? or just re-training for 4 GPUs?

Here's the code.

Traceback (most recent call last):
  File "main_moe.py", line 367, in <module>
    main(config)
  File "main_moe.py", line 139, in main
    max_accuracy = load_checkpoint(config, model_without_ddp, optimizer, lr_scheduler, loss_scaler, logger)
  File "/root/Swin-Transformer_ranggi/utils_moe.py", line 44, in load_checkpoint
    msg = model.load_state_dict(checkpoint['model'], strict=False)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1370, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SwinTransformerMoE:
	size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([8, 2048, 512]) from checkpoint, the shape in current model is torch.Size([2, 2048, 512]).

ranggihwang avatar Oct 16 '22 16:10 ranggihwang