DeepSpeed
DeepSpeed copied to clipboard
[BUG] ZeRO Stage2 and 3, error while loss backward
In ZeRO stage1, it works.
I used from_pretrained("facebook/bart-base")
as backbone in transformers==4.2.1
In ZeRO stage2, backward
stop working like infinite loop at some processes.
In ZeRO stage3,
[2022-05-18 14:33:12,828] [WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch.device'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
[2022-05-18 14:33:12,828] [WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch.device'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
self.model.backward(loss)
File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/engine.py", line 1726, in backward
self.optimizer.backward(loss)
File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2538, in backward
self._get_param_coordinator(training=True).reset_step()
File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 194, in reset_step
assert_ints_same_as_other_ranks([m.id for m in self.__submodule_order])
File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/utils.py", line 86, in assert_ints_same_as_other_ranks
raise RuntimeError(f"disagreement between rank0 and rank{dist.get_rank()}: "
RuntimeError: disagreement between rank0 and rank1: rank0: [0, 1, 3, 103, 103, 103, 103, 5, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 76, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 68, 69, 70, 71, 73, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 67, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 102, 103, 103, 103, 103, 104, 202, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 203, 0, 203, 1, 102, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 186, 201, 200, 199, 198, 193, 197, 195, 194, 196, 192, 187, 191, 189, 188, 190, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 170, 185, 184, 183, 182, 177, 181, 179, 178, 180, 176, 171, 175, 173, 172, 174, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 154, 169, 168, 167, 166, 161, 165, 163, 162, 164, 160, 155, 159, 157, 156, 158, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 138, 153, 152, 151, 150, 145, 149, 147, 146, 148, 144, 139, 143, 141, 140, 142, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 122, 137, 136, 135, 134, 129, 133, 131, 130, 132, 128, 123, 127, 125, 124, 126, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 106, 121, 120, 119, 118, 113, 117, 115, 114, 3, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 57, 66, 65, 64, 63, 58, 62, 60, 59, 61, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 47, 56, 55, 54, 53, 48, 52, 50, 49, 51, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 37, 46, 45, 44, 43, 38, 42, 40, 39, 41, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 27, 36, 35, 34, 33, 28, 32, 30, 29, 31, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 17, 26, 25, 24, 23, 18, 22, 20, 19, 21, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 7, 16, 15, 14, 13, 8, 12, 10, 9, 11, 67, 68, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 69, 71, 70, 76, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80], rank1: [0, 1, 3, 103, 103, 103, 103, 5, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 76, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 68, 69, 70, 71, 73, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 67, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 102, 103, 103, 103, 103, 104, 202, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 203, 0, 203, 1, 102, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 186, 201, 200, 199, 198, 193, 197, 195, 194, 196, 192, 187, 191, 189, 188, 190, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 170, 185, 184, 183, 182, 177, 181, 179, 178, 180, 176, 171, 175, 173, 172, 174, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 154, 169, 168, 167, 166, 161, 165, 163, 162, 164, 160, 155, 159, 157, 156, 158, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 138, 153, 152, 151, 150, 145, 149, 147, 146, 148, 144, 139, 143, 141, 140, 142, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 122, 137, 136, 135, 134, 129, 133, 131, 130, 132, 128, 123, 127, 125, 124, 126, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 106, 121, 120, 119, 118, 113, 117, 115, 114, 3, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 57, 66, 65, 64, 63, 58, 62, 60, 59, 61, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 47, 56, 55, 54, 53, 48, 52, 50, 49, 51, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 37, 46, 45, 44, 43, 38, 42, 40, 39, 41, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 27, 36, 35, 34, 33, 28, 32, 30, 29, 31, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 17, 26, 25, 24, 23, 18, 22, 20, 19, 21, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 7, 16, 15, 14, 13, 8, 12, 10, 9, 11, 67, 68, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 69, 71, 70, 76, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98]
Stage2 config
{
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"train_batch_size": 100,
"train_micro_batch_size_per_gpu": 50,
"steps_per_print":10000
}
Stage3 config
{
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"stage3_max_live_parameters" : 1e9,
"stage3_max_reuse_distance" : 1e9,
"stage3_prefetch_bucket_size" : 5e8,
"stage3_param_persistence_threshold" : 1e6,
"sub_group_size" : 1e12,
"elastic_checkpoint" : [true],
"stage3_gather_16bit_weights_on_model_save": [false],
"ignore_unused_parameters": [true],
"round_robin_gradients": [false]
},
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"train_batch_size": 100,
"train_micro_batch_size_per_gpu": 50,
"steps_per_print":10000
}
System info (please complete the following information):
- OS: Ubuntu 20.04
- used 2 A100 GPUs, [one machines with x16 A100s each]
- Python: 3.8.13
- DeepSpeed: 0.6.5+5053217e
- Transformers: 4.2.1
Have you figured it out? I got the same error.
I got the same error.
I got the same error, any update on this issue?
I'm getting a similar error running llama2 7B on 4 L4 GPUs in stage 3
deepspeed:
train_micro_batch_size_per_gpu: 4096
eval_micro_batch_size_per_gpu: 2048
prescale_gradients: false
bf16:
enabled: true
gradient_clipping: 10.0
optimizer:
type: "Adam"
params:
lr: 1.0e-5 # Larger LR due to LoRA
betas:
- 0.8
- 0.999
eps: 1.0e-8
weight_decay: 3.0e-7
scheduler:
type: "WarmupLR"
params:
warmup_min_lr: 1.0e-6
warmup_max_lr: 1.0e-5
warmup_num_steps: 50
zero_optimization:
stage: 3
allgather_partitions: true
allgather_bucket_size: 500000000
overlap_comm: false
reduce_scatter: true
reduce_bucket_size: 500000000
contiguous_gradients: true
Seems like one rank outputs very low negative numbers or very high positive numbers in my case:
RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135],
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]
Thanks a lot for sharing @LeSphax! I am still in the middle of debugging myself, and my output is similar to yours, I got it at the start of my validation (I am using the latest version of Pytorch Lightning btw). I have traced down the error and it seems that for the deepspeed library i am using (0.10.X), the error is related to
if not self.is_complete_trace(): # not self.trace_complete:
# Make sure that recorded submodule orders are identical across ranks
assert_ints_same_as_other_ranks([m.id for m in self.__submodule_order])
which is at line 200 in partitioned_param_coordinator.py under runtime/zero in deepspeed library. I am now using trial data to see if it is the issue in my own code. I will update you if I manage to fix my issues somehow...
I also encountered the issue and finally fixed it. In my original code, I had an operation like
L = 0.0
for ...:
L = L + a_torch_tensor
L = L / b # b is a float or int
where L was initialized as a float, but changed as a torch tensor. After modifying the code as:
L_list = []
for ...:
L_list.append(a_torch_tensor)
L = torch.stack(L_list, dim=0).sum(0) / b
the error disappeared!
Hello, I am seeing the same error as others mentioned. I am using deepspeed_stage_3 with PyTorch Lightning, and all deepspeed settings are set to defaults:
trainer = lightning.Trainer(
strategy = "deepspeed_stage_3",
precision = "bf16-mixed",
devices = 8,
num_nodes = 1,
)
Can someone suggest a workaround? Thank you
The specific error message is similar to the one @LeSphax added above:
RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135],
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]
Hello, I am seeing the same error as others mentioned. I am using deepspeed_stage_3 with PyTorch Lightning, and all deepspeed settings are set to defaults:
trainer = lightning.Trainer( strategy = "deepspeed_stage_3", precision = "bf16-mixed", devices = 8, num_nodes = 1, )
Can someone suggest a workaround? Thank you
The specific error message is similar to the one @LeSphax added above:
RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135], rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]
the same error, and just stage 3 will have this error
same error
In ZeRO stage1, it works.
I used
from_pretrained("facebook/bart-base")
as backbone in transformers==4.2.1In ZeRO stage2,
backward
stop working like infinite loop at some processes.In ZeRO stage3,
[2022-05-18 14:33:12,828] [WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch.device'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly. [2022-05-18 14:33:12,828] [WARNING] [stage3.py:106:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch.device'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly. self.model.backward(loss) File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn return func(*args, **kwargs) File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/engine.py", line 1726, in backward self.optimizer.backward(loss) File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn return func(*args, **kwargs) File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2538, in backward self._get_param_coordinator(training=True).reset_step() File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 194, in reset_step assert_ints_same_as_other_ranks([m.id for m in self.__submodule_order]) File "/home/kyungmin.lee/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn return func(*args, **kwargs) File "/home/kyungmin.lee/DeepSpeed/deepspeed/runtime/zero/utils.py", line 86, in assert_ints_same_as_other_ranks raise RuntimeError(f"disagreement between rank0 and rank{dist.get_rank()}: " RuntimeError: disagreement between rank0 and rank1: rank0: [0, 1, 3, 103, 103, 103, 103, 5, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 76, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 68, 69, 70, 71, 73, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 67, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 102, 103, 103, 103, 103, 104, 202, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 203, 0, 203, 1, 102, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 186, 201, 200, 199, 198, 193, 197, 195, 194, 196, 192, 187, 191, 189, 188, 190, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 170, 185, 184, 183, 182, 177, 181, 179, 178, 180, 176, 171, 175, 173, 172, 174, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 154, 169, 168, 167, 166, 161, 165, 163, 162, 164, 160, 155, 159, 157, 156, 158, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 138, 153, 152, 151, 150, 145, 149, 147, 146, 148, 144, 139, 143, 141, 140, 142, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 122, 137, 136, 135, 134, 129, 133, 131, 130, 132, 128, 123, 127, 125, 124, 126, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 106, 121, 120, 119, 118, 113, 117, 115, 114, 3, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 57, 66, 65, 64, 63, 58, 62, 60, 59, 61, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 47, 56, 55, 54, 53, 48, 52, 50, 49, 51, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 37, 46, 45, 44, 43, 38, 42, 40, 39, 41, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 27, 36, 35, 34, 33, 28, 32, 30, 29, 31, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 17, 26, 25, 24, 23, 18, 22, 20, 19, 21, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 7, 16, 15, 14, 13, 8, 12, 10, 9, 11, 67, 68, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 69, 71, 70, 76, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80], rank1: [0, 1, 3, 103, 103, 103, 103, 5, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 76, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 85, 86, 87, 88, 89, 85, 86, 87, 88, 89, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 79, 80, 81, 82, 83, 97, 98, 99, 100, 101, 68, 69, 70, 71, 73, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 67, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 102, 103, 103, 103, 103, 104, 202, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 203, 0, 203, 1, 102, 186, 187, 190, 188, 189, 191, 192, 193, 196, 194, 195, 197, 198, 199, 200, 201, 186, 201, 200, 199, 198, 193, 197, 195, 194, 196, 192, 187, 191, 189, 188, 190, 170, 171, 174, 172, 173, 175, 176, 177, 180, 178, 179, 181, 182, 183, 184, 185, 170, 185, 184, 183, 182, 177, 181, 179, 178, 180, 176, 171, 175, 173, 172, 174, 154, 155, 158, 156, 157, 159, 160, 161, 164, 162, 163, 165, 166, 167, 168, 169, 154, 169, 168, 167, 166, 161, 165, 163, 162, 164, 160, 155, 159, 157, 156, 158, 138, 139, 142, 140, 141, 143, 144, 145, 148, 146, 147, 149, 150, 151, 152, 153, 138, 153, 152, 151, 150, 145, 149, 147, 146, 148, 144, 139, 143, 141, 140, 142, 122, 123, 126, 124, 125, 127, 128, 129, 132, 130, 131, 133, 134, 135, 136, 137, 122, 137, 136, 135, 134, 129, 133, 131, 130, 132, 128, 123, 127, 125, 124, 126, 106, 107, 110, 108, 109, 111, 112, 113, 116, 114, 115, 117, 118, 119, 120, 121, 106, 121, 120, 119, 118, 113, 117, 115, 114, 3, 57, 58, 61, 59, 60, 62, 63, 64, 65, 66, 57, 66, 65, 64, 63, 58, 62, 60, 59, 61, 47, 48, 51, 49, 50, 52, 53, 54, 55, 56, 47, 56, 55, 54, 53, 48, 52, 50, 49, 51, 37, 38, 41, 39, 40, 42, 43, 44, 45, 46, 37, 46, 45, 44, 43, 38, 42, 40, 39, 41, 27, 28, 31, 29, 30, 32, 33, 34, 35, 36, 27, 36, 35, 34, 33, 28, 32, 30, 29, 31, 17, 18, 21, 19, 20, 22, 23, 24, 25, 26, 17, 26, 25, 24, 23, 18, 22, 20, 19, 21, 7, 8, 11, 9, 10, 12, 13, 14, 15, 16, 7, 16, 15, 14, 13, 8, 12, 10, 9, 11, 67, 68, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 69, 71, 70, 76, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 85, 89, 88, 87, 86, 79, 83, 82, 81, 80, 97, 101, 100, 99, 98, 79, 83, 82, 81, 80, 85, 89, 88, 87, 86, 97, 101, 100, 99, 98, 97, 101, 100, 99, 98]
Stage2 config
{ "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5e8, "contiguous_gradients": true }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "train_batch_size": 100, "train_micro_batch_size_per_gpu": 50, "steps_per_print":10000 }
Stage3 config
{ "zero_optimization": { "stage": 3, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5e8, "contiguous_gradients": true, "stage3_max_live_parameters" : 1e9, "stage3_max_reuse_distance" : 1e9, "stage3_prefetch_bucket_size" : 5e8, "stage3_param_persistence_threshold" : 1e6, "sub_group_size" : 1e12, "elastic_checkpoint" : [true], "stage3_gather_16bit_weights_on_model_save": [false], "ignore_unused_parameters": [true], "round_robin_gradients": [false] }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "train_batch_size": 100, "train_micro_batch_size_per_gpu": 50, "steps_per_print":10000 }
System info (please complete the following information):
- OS: Ubuntu 20.04
- used 2 A100 GPUs, [one machines with x16 A100s each]
- Python: 3.8.13
- DeepSpeed: 0.6.5+5053217e
- Transformers: 4.2.1
Hi thx @lkm2835 for raising the issue, could you provide a simple python script for us to easily reproduce your error? thx in advance
same error
same error
Thanks a lot for sharing @LeSphax! I am still in the middle of debugging myself, and my output is similar to yours, I got it at the start of my validation (I am using the latest version of Pytorch Lightning btw). I have traced down the error and it seems that for the deepspeed library i am using (0.10.X), the error is related to
if not self.is_complete_trace(): # not self.trace_complete: # Make sure that recorded submodule orders are identical across ranks assert_ints_same_as_other_ranks([m.id for m in self.__submodule_order])
which is at line 200 in partitioned_param_coordinator.py under runtime/zero in deepspeed library. I am now using trial data to see if it is the issue in my own code. I will update you if I manage to fix my issues somehow...
complete same error under eval situation. have you solved ?