DeepSpeed
DeepSpeed copied to clipboard
[BUG]RuntimeError: output tensor must have the same type as input tensor
here is my config
{
"bf_16": {
"enable": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 2e8,
"stage3_prefetch_bucket_size": 50000000,
"stage3_param_persistence_threshold": 100000,
"sub_group_size": 1e9,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": "auto"
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
and I use huggingface accelerate with 8 * A100 GPU, model is bigcode/starcoder error message is
File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 313, in allgather_fn
return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
return func(*args, **kwargs)
File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 298, in all_gather_into_tensor
return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 125, in all_gather_into_tensor
return self.all_gather_function(output_tensor=output_tensor,
File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper
return func(*args, **kwargs)
File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2517, in all_gather_into_tensor
work = group._allgather_base(output_tensor, input_tensor)
RuntimeError: output tensor must have the same type as input tensor
I print the dtype of input_tensor and output_tensor inject to https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/comm/comm.py#L312, dtype of output_tensor is torch.float64 and dtype for input_tensor is torch.float32
if I use single GPU, it would not have this error but get out of GPU memory error
@DogeWatch the format for enabling bf16 in your config is incorrect it should be
"bf16": { "enabled": true },
@DogeWatch the format for enabling bf16 in your config is incorrect it should be
"bf16": { "enabled": true },
@jomayeri sorry, it's my mistake, but with "bf16" it still have this error
I have the same error, any workaround?
I resolved the issue by modifying the Trainer arguments from --bf16 to --fp16. I'm currently utilizing the combination of PyTorch 2.0 and Deepspeed. However, I've only come across this problem when using Deepspeed with PyTorch 2.0. The problem resulted in a mismatched data type for the input_tensor and output_tensor in Deepspeed's runtime/zero/partition_parameters.py function def _dist_allgather_fn
- one was torch.fp32 and the other was torch.fp16. I'm unsure whether PyTorch 2.0 is the root cause of this issue. By any chance, have you also used PyTorch 2.0?
Here's the related piece of code:
def _dist_allgather_fn(input_tensor: Tensor, output_tensor: Tensor, group=None): # print(input_tensor.dtype,output_tensor.dtype) return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
@ShadowTeamCN or @DogeWatch do you have a script I can use to reproduce this error with torch 2.0?
I resolved the issue by modifying the Trainer arguments from --bf16 to --fp16. I'm currently utilizing the combination of PyTorch 2.0 and Deepspeed. However, I've only come across this problem when using Deepspeed with PyTorch 2.0. The problem resulted in a mismatched data type for the input_tensor and output_tensor in Deepspeed's runtime/zero/partition_parameters.py function def _dist_allgather_fn
- one was torch.fp32 and the other was torch.fp16. I'm unsure whether PyTorch 2.0 is the root cause of this issue. By any chance, have you also used PyTorch 2.0?
Here's the related piece of code:
def _dist_allgather_fn(input_tensor: Tensor, output_tensor: Tensor, group=None):
print(input_tensor.dtype,output_tensor.dtype)
return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
Switch to --fp16 works in my environment:
- torch: 1.13.1
- GPU: 2*8 A800-80G
- deepspeed: 0.10.0
- stage3 + zero++
- deepspeed config
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"fp16": {"enabled": true},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"last_batch_iteration": -1,
"total_num_steps": "auto",
"warmup_min_lr": 5e-7,
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_param": {"device": "cpu", "pin_memory": true},
"offload_optimizer": {"device": "cpu", "pin_memory": true},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"zero_quantized_weights": true,
"zero_hpz_partition_size": 8,
"zero_quantized_gradients": true
}
}
updates:
--bf16works after changezero_quantized_weightstofalse
Appears resolved with the correct flag.
Same issue here! --bf16 yields the error while --fp16 does not.
显示已解决,并带有正确的标志。
Has the problem been resolved?
After changing the config following @Sanster, it works well with llama-7b model using bf16. But it doesn't work with mistralai/Mixtral-8x7B-Instruct-v0.1. Any solution?
I have the same issue with QLORA finetuning for Mixtral. Using @Sanster's config, it was running and then all GPUs hang with 100% memory being used without proceeding.
#5049