DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] ZeRO Stage2 and 3, error while loss backward

Open lkm2835 opened this issue 2 years ago • 13 comments

In ZeRO stage1, it works.

I used from_pretrained("facebook/bart-base") as backbone in transformers==4.2.1

In ZeRO stage2, backward stop working like infinite loop at some processes.

In ZeRO stage3,

[2022-05-18 14:33:12,828] [WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch.device'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
[2022-05-18 14:33:12,828] [WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch.device'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.

    self.model.backward(loss)
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/engine.py", line 1726, in backward
    self.optimizer.backward(loss)
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2538, in backward
    self._get_param_coordinator(training=True).reset_step()
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 194, in reset_step
    assert_ints_same_as_other_ranks([m.id for m in self.__submodule_order])
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/utils.py", line 86, in assert_ints_same_as_other_ranks
    raise RuntimeError(f"disagreement between rank0 and rank{dist.get_rank()}: "
RuntimeError: disagreement between rank0 and rank1: rank0: [0, 1, 3, 103, 103, 103, 103, 5, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 76, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 68, 69, 70, 71, 73, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 67, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 102, 103, 103, 103, 103, 104, 202, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 203, 0, 203, 1, 102, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 186, 201, 200, 199, 198, 193, 197, 195, 194, 196, 192, 187, 191, 189, 188, 190, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 170, 185, 184, 183, 182, 177, 181, 179, 178, 180, 176, 171, 175, 173, 172, 174, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 154, 169, 168, 167, 166, 161, 165, 163, 162, 164, 160, 155, 159, 157, 156, 158, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 138, 153, 152, 151, 150, 145, 149, 147, 146, 148, 144, 139, 143, 141, 140, 142, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 122, 137, 136, 135, 134, 129, 133, 131, 130, 132, 128, 123, 127, 125, 124, 126, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 106, 121, 120, 119, 118, 113, 117, 115, 114, 3, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 57, 66, 65, 64, 63, 58, 62, 60, 59, 61, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 47, 56, 55, 54, 53, 48, 52, 50, 49, 51, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 37, 46, 45, 44, 43, 38, 42, 40, 39, 41, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 27, 36, 35, 34, 33, 28, 32, 30, 29, 31, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 17, 26, 25, 24, 23, 18, 22, 20, 19, 21, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 7, 16, 15, 14, 13, 8, 12, 10, 9, 11, 67, 68, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 69, 71, 70, 76, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80], rank1: [0, 1, 3, 103, 103, 103, 103, 5, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 76, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 68, 69, 70, 71, 73, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 67, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 102, 103, 103, 103, 103, 104, 202, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 203, 0, 203, 1, 102, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 186, 201, 200, 199, 198, 193, 197, 195, 194, 196, 192, 187, 191, 189, 188, 190, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 170, 185, 184, 183, 182, 177, 181, 179, 178, 180, 176, 171, 175, 173, 172, 174, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 154, 169, 168, 167, 166, 161, 165, 163, 162, 164, 160, 155, 159, 157, 156, 158, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 138, 153, 152, 151, 150, 145, 149, 147, 146, 148, 144, 139, 143, 141, 140, 142, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 122, 137, 136, 135, 134, 129, 133, 131, 130, 132, 128, 123, 127, 125, 124, 126, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 106, 121, 120, 119, 118, 113, 117, 115, 114, 3, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 57, 66, 65, 64, 63, 58, 62, 60, 59, 61, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 47, 56, 55, 54, 53, 48, 52, 50, 49, 51, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 37, 46, 45, 44, 43, 38, 42, 40, 39, 41, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 27, 36, 35, 34, 33, 28, 32, 30, 29, 31, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 17, 26, 25, 24, 23, 18, 22, 20, 19, 21, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 7, 16, 15, 14, 13, 8, 12, 10, 9, 11, 67, 68, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 69, 71, 70, 76, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98]

Stage2 config

{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true
  },
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"train_batch_size": 100,
"train_micro_batch_size_per_gpu": 50,
"steps_per_print":10000
}

Stage3 config

{
  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "stage3_max_live_parameters" : 1e9,
    "stage3_max_reuse_distance" : 1e9,
    "stage3_prefetch_bucket_size" : 5e8,
    "stage3_param_persistence_threshold" : 1e6,
    "sub_group_size" : 1e12,
    "elastic_checkpoint" : [true],
    "stage3_gather_16bit_weights_on_model_save": [false],
    "ignore_unused_parameters": [true],
    "round_robin_gradients": [false]
  },
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"train_batch_size": 100,
"train_micro_batch_size_per_gpu": 50,
"steps_per_print":10000
}

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • used 2 A100 GPUs, [one machines with x16 A100s each]
  • Python: 3.8.13
  • DeepSpeed: 0.6.5+5053217e
  • Transformers: 4.2.1

lkm2835 avatar May 18 '22 05:05 lkm2835

Have you figured it out? I got the same error.

leao1995 avatar Feb 20 '23 19:02 leao1995

I got the same error.

fengpeng-yue avatar Jul 27 '23 10:07 fengpeng-yue

I got the same error, any update on this issue?

dnaihao avatar Aug 11 '23 20:08 dnaihao

I'm getting a similar error running llama2 7B on 4 L4 GPUs in stage 3

deepspeed:
    train_micro_batch_size_per_gpu: 4096
    eval_micro_batch_size_per_gpu: 2048
    prescale_gradients: false
    bf16:
      enabled: true
    gradient_clipping: 10.0
    optimizer:
      type: "Adam"
      params:
        lr: 1.0e-5  # Larger LR due to LoRA
        betas:
          - 0.8
          - 0.999
        eps: 1.0e-8
        weight_decay: 3.0e-7
    scheduler:
      type: "WarmupLR"
      params:
        warmup_min_lr: 1.0e-6
        warmup_max_lr: 1.0e-5
        warmup_num_steps: 50
    zero_optimization:
      stage: 3
      allgather_partitions: true
      allgather_bucket_size: 500000000
      overlap_comm: false
      reduce_scatter: true
      reduce_bucket_size: 500000000
      contiguous_gradients: true

Seems like one rank outputs very low negative numbers or very high positive numbers in my case:

RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135], 
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]

LeSphax avatar Aug 11 '23 20:08 LeSphax

Thanks a lot for sharing @LeSphax! I am still in the middle of debugging myself, and my output is similar to yours, I got it at the start of my validation (I am using the latest version of Pytorch Lightning btw). I have traced down the error and it seems that for the deepspeed library i am using (0.10.X), the error is related to

        if not self.is_complete_trace():  # not self.trace_complete:
        # Make sure that recorded submodule orders are identical across ranks
        assert_ints_same_as_other_ranks([m.id for m in self.__submodule_order])

which is at line 200 in partitioned_param_coordinator.py under runtime/zero in deepspeed library. I am now using trial data to see if it is the issue in my own code. I will update you if I manage to fix my issues somehow...

dnaihao avatar Aug 11 '23 21:08 dnaihao

I also encountered the issue and finally fixed it. In my original code, I had an operation like

L = 0.0
for ...:
    L = L + a_torch_tensor
L = L / b # b is a float or int

where L was initialized as a float, but changed as a torch tensor. After modifying the code as:

L_list = []
for ...:
    L_list.append(a_torch_tensor)
L = torch.stack(L_list, dim=0).sum(0) / b

the error disappeared!

ccx1997 avatar Nov 15 '23 09:11 ccx1997

Hello, I am seeing the same error as others mentioned. I am using deepspeed_stage_3 with PyTorch Lightning, and all deepspeed settings are set to defaults:

trainer = lightning.Trainer(
  strategy = "deepspeed_stage_3",
  precision = "bf16-mixed",
  devices = 8,
  num_nodes = 1,
)

Can someone suggest a workaround? Thank you

The specific error message is similar to the one @LeSphax added above:

RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135], 
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]

m-harmonic avatar Nov 29 '23 23:11 m-harmonic

Hello, I am seeing the same error as others mentioned. I am using deepspeed_stage_3 with PyTorch Lightning, and all deepspeed settings are set to defaults:

trainer = lightning.Trainer(
  strategy = "deepspeed_stage_3",
  precision = "bf16-mixed",
  devices = 8,
  num_nodes = 1,
)

Can someone suggest a workaround? Thank you

The specific error message is similar to the one @LeSphax added above:

RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135], 
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]

the same error, and just stage 3 will have this error

Xnhyacinth avatar Dec 13 '23 10:12 Xnhyacinth

same error

tuyaao avatar Jan 04 '24 07:01 tuyaao

In ZeRO stage1, it works.

I used from_pretrained("facebook/bart-base") as backbone in transformers==4.2.1

In ZeRO stage2, backward stop working like infinite loop at some processes.

In ZeRO stage3,

[2022-05-18 14:33:12,828] [WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch.device'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
[2022-05-18 14:33:12,828] [WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch.device'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.

    self.model.backward(loss)
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/engine.py", line 1726, in backward
    self.optimizer.backward(loss)
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2538, in backward
    self._get_param_coordinator(training=True).reset_step()
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 194, in reset_step
    assert_ints_same_as_other_ranks([m.id for m in self.__submodule_order])
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/utils.py", line 86, in assert_ints_same_as_other_ranks
    raise RuntimeError(f"disagreement between rank0 and rank{dist.get_rank()}: "
RuntimeError: disagreement between rank0 and rank1: rank0: [0, 1, 3, 103, 103, 103, 103, 5, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 76, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 68, 69, 70, 71, 73, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 67, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 102, 103, 103, 103, 103, 104, 202, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 203, 0, 203, 1, 102, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 186, 201, 200, 199, 198, 193, 197, 195, 194, 196, 192, 187, 191, 189, 188, 190, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 170, 185, 184, 183, 182, 177, 181, 179, 178, 180, 176, 171, 175, 173, 172, 174, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 154, 169, 168, 167, 166, 161, 165, 163, 162, 164, 160, 155, 159, 157, 156, 158, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 138, 153, 152, 151, 150, 145, 149, 147, 146, 148, 144, 139, 143, 141, 140, 142, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 122, 137, 136, 135, 134, 129, 133, 131, 130, 132, 128, 123, 127, 125, 124, 126, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 106, 121, 120, 119, 118, 113, 117, 115, 114, 3, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 57, 66, 65, 64, 63, 58, 62, 60, 59, 61, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 47, 56, 55, 54, 53, 48, 52, 50, 49, 51, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 37, 46, 45, 44, 43, 38, 42, 40, 39, 41, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 27, 36, 35, 34, 33, 28, 32, 30, 29, 31, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 17, 26, 25, 24, 23, 18, 22, 20, 19, 21, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 7, 16, 15, 14, 13, 8, 12, 10, 9, 11, 67, 68, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 69, 71, 70, 76, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80], rank1: [0, 1, 3, 103, 103, 103, 103, 5, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 76, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 68, 69, 70, 71, 73, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 67, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 102, 103, 103, 103, 103, 104, 202, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 203, 0, 203, 1, 102, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 186, 201, 200, 199, 198, 193, 197, 195, 194, 196, 192, 187, 191, 189, 188, 190, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 170, 185, 184, 183, 182, 177, 181, 179, 178, 180, 176, 171, 175, 173, 172, 174, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 154, 169, 168, 167, 166, 161, 165, 163, 162, 164, 160, 155, 159, 157, 156, 158, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 138, 153, 152, 151, 150, 145, 149, 147, 146, 148, 144, 139, 143, 141, 140, 142, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 122, 137, 136, 135, 134, 129, 133, 131, 130, 132, 128, 123, 127, 125, 124, 126, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 106, 121, 120, 119, 118, 113, 117, 115, 114, 3, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 57, 66, 65, 64, 63, 58, 62, 60, 59, 61, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 47, 56, 55, 54, 53, 48, 52, 50, 49, 51, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 37, 46, 45, 44, 43, 38, 42, 40, 39, 41, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 27, 36, 35, 34, 33, 28, 32, 30, 29, 31, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 17, 26, 25, 24, 23, 18, 22, 20, 19, 21, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 7, 16, 15, 14, 13, 8, 12, 10, 9, 11, 67, 68, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 69, 71, 70, 76, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98]

Stage2 config

{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true
  },
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"train_batch_size": 100,
"train_micro_batch_size_per_gpu": 50,
"steps_per_print":10000
}

Stage3 config

{
  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "stage3_max_live_parameters" : 1e9,
    "stage3_max_reuse_distance" : 1e9,
    "stage3_prefetch_bucket_size" : 5e8,
    "stage3_param_persistence_threshold" : 1e6,
    "sub_group_size" : 1e12,
    "elastic_checkpoint" : [true],
    "stage3_gather_16bit_weights_on_model_save": [false],
    "ignore_unused_parameters": [true],
    "round_robin_gradients": [false]
  },
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"train_batch_size": 100,
"train_micro_batch_size_per_gpu": 50,
"steps_per_print":10000
}

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • used 2 A100 GPUs, [one machines with x16 A100s each]
  • Python: 3.8.13
  • DeepSpeed: 0.6.5+5053217e
  • Transformers: 4.2.1

Hi thx @lkm2835 for raising the issue, could you provide a simple python script for us to easily reproduce your error? thx in advance

GuanhuaWang avatar Jan 16 '24 06:01 GuanhuaWang

same error

liuqi8827 avatar Mar 28 '24 14:03 liuqi8827

same error

whu-dft avatar Jul 22 '24 09:07 whu-dft

Thanks a lot for sharing @LeSphax! I am still in the middle of debugging myself, and my output is similar to yours, I got it at the start of my validation (I am using the latest version of Pytorch Lightning btw). I have traced down the error and it seems that for the deepspeed library i am using (0.10.X), the error is related to

        if not self.is_complete_trace():  # not self.trace_complete:
        # Make sure that recorded submodule orders are identical across ranks
        assert_ints_same_as_other_ranks([m.id for m in self.__submodule_order])

which is at line 200 in partitioned_param_coordinator.py under runtime/zero in deepspeed library. I am now using trial data to see if it is the issue in my own code. I will update you if I manage to fix my issues somehow...

complete same error under eval situation. have you solved ?

yiyepiaoling0715 avatar Jul 25 '24 02:07 yiyepiaoling0715