DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG]Runtime error when enabling fp16 config with tasks using SyncBatchNorm

Open sangsang96 opened this issue 3 years ago • 0 comments

Describe the bug Following runtime error when enabling fp16 config with tasks using model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) and working with multiple GPUs

Is there another way of making my batchnorms synchronized when enabling fp16 config?

2022-07-18 14:49:25,373 non_mp_trainer.py INFO]: ***** Running training *****
[2022-07-18 14:49:25,373 non_mp_trainer.py INFO]:   Num examples = 889534
[2022-07-18 14:49:25,373 non_mp_trainer.py INFO]:   Num Epochs = 60
[2022-07-18 14:49:25,373 non_mp_trainer.py INFO]:   Instantaneous batch size per device = 96
[2022-07-18 14:49:25,373 non_mp_trainer.py INFO]:   Total train batch size (w. parallel, distributed & accumulation) = 192
[2022-07-18 14:49:25,373 non_mp_trainer.py INFO]:   Gradient Accumulation steps = 1
[2022-07-18 14:49:25,373 non_mp_trainer.py INFO]:   Total optimization steps = 277920
  0%|          | 0/277920 [00:00<?, ?it/s]start training epoch: 0
/usr/local/anaconda3/lib/python3.7/site-packages/PIL/Image.py:989: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
  "Palette images with Transparency expressed in bytes should be "
Traceback (most recent call last):
  File "train.py", line 117, in <module>
    trainer.train()
  File "/media/cfs/nlp/vlp/magnus-models/vlp/magnus-api/magnus_api/magnus_trainer.py", line 109, in train
    return self._trainer.train()
  File "/media/cfs/nlp/vlp/magnus-models/vlp/magnus-core/magnus_core/trainer/non_mp_trainer.py", line 809, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1891, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1923, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1606, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nlp/vlp/magnus-models/vlp/custom_model.py", line 271, in forward
    img_embeds = self.image_encoder(image).transpose(-1,-2)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nlp/vlp/magnus-models/vlp/models/RepVGG/repvgg.py", line 181, in forward
    out = self.stage0(x)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nlp/vlp/magnus-models/vlp/models/RepVGG/repvgg.py", line 55, in forward
    return self.nonlinearity(self.se(self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out))
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 758, in forward
    world_size,
File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/_functions.py", line 65, in forward
    count_all.view(-1)
RuntimeError: Expected counts to have type Half but got Float
Traceback (most recent call last):
  File "train.py", line 117, in <module>
    trainer.train()
  File "/media/cfs/nlp/vlp/magnus-models/vlp/magnus-api/magnus_api/magnus_trainer.py", line 109, in train
    return self._trainer.train()
  File "/media/cfs/nlp/vlp/magnus-models/vlp/magnus-core/magnus_core/trainer/non_mp_trainer.py", line 809, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1891, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1923, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1606, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nlp/vlp/magnus-models/vlp/custom_model.py", line 271, in forward
    img_embeds = self.image_encoder(image).transpose(-1,-2)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nlp/vlp/magnus-models/vlp/models/RepVGG/repvgg.py", line 181, in forward
    out = self.stage0(x)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nlp/vlp/magnus-models/vlp/models/RepVGG/repvgg.py", line 55, in forward
    return self.nonlinearity(self.se(self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out))
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 758, in forward
    world_size,
  File "/usr/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/_functions.py", line 65, in forward
    count_all.view(-1)
RuntimeError: Expected counts to have type Half but got Float
  0%|          | 0/277920 [00:06<?, ?it/s]
[2022-07-18 14:49:34,803] [INFO] [launch.py:131:sigkill_handler] Killing subprocess 10096
[2022-07-18 14:49:34,804] [INFO] [launch.py:131:sigkill_handler] Killing subprocess 10097

sangsang96 avatar Jul 18 '22 07:07 sangsang96