pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Deepspeed Stage 3 crashes Lightning trainer

Open m-harmonic opened this issue 2 years ago • 2 comments

Bug description

We are using the deepspeed_stage_3 strategy with default deepspeed settings, via the following code:

trainer = lightning.Trainer(
  strategy = "deepspeed_stage_3",
  precision = "bf16-mixed",
  devices = 8,
  num_nodes = 1,
)

Running training crashes with an error of the form:

RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135], 
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]

No error occurs when using deepspeed_stage_2 with all other settings as the same. We are looking for suggestions on how to fix, or at least work around this problem. Has anyone seen this before? Thank you for any help.

The error has also been reported on the Microsoft Deepspeed github page, but with no reply from developers yet: https://github.com/microsoft/DeepSpeed/issues/1960

What version are you seeing the problem on?

v2.1

How to reproduce the bug

trainer = lightning.Trainer(
  strategy = "deepspeed_stage_3",
  precision = "bf16-mixed",
  devices = 8,
  num_nodes = 1,
)
trainer.fit(model, dataset)

Error messages and logs

RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135], 
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @awaelchli

m-harmonic avatar Nov 30 '23 18:11 m-harmonic

the same error

Xnhyacinth avatar Dec 13 '23 10:12 Xnhyacinth

the same error

tuyaao avatar Jan 04 '24 07:01 tuyaao