DeepSpeed [BUG]RuntimeError: output tensor must have the same type as input tensor

here is my config

{
    "bf_16": {
        "enable": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": 2e8,
        "stage3_prefetch_bucket_size": 50000000,
        "stage3_param_persistence_threshold": 100000,
        "sub_group_size": 1e9,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "auto"
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

and I use huggingface accelerate with 8 * A100 GPU, model is bigcode/starcoder error message is

File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 313, in allgather_fn
    return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
    return func(*args, **kwargs)
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 298, in all_gather_into_tensor
    return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 125, in all_gather_into_tensor
    return self.all_gather_function(output_tensor=output_tensor,
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper
    return func(*args, **kwargs)
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2517, in all_gather_into_tensor
    work = group._allgather_base(output_tensor, input_tensor)
RuntimeError: output tensor must have the same type as input tensor

I print the dtype of input_tensor and output_tensor inject to https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/comm/comm.py#L312, dtype of output_tensor is torch.float64 and dtype for input_tensor is torch.float32 if I use single GPU, it would not have this error but get out of GPU memory error

Jun 01 '23 06:06 DogeWatch

@DogeWatch the format for enabling bf16 in your config is incorrect it should be "bf16": { "enabled": true },

Jun 02 '23 18:06 jomayeri

@DogeWatch the format for enabling bf16 in your config is incorrect it should be "bf16": { "enabled": true },

@jomayeri sorry, it's my mistake, but with "bf16" it still have this error

Jun 05 '23 11:06 DogeWatch

I have the same error, any workaround?

Jun 23 '23 09:06 flymin

I resolved the issue by modifying the Trainer arguments from --bf16 to --fp16. I'm currently utilizing the combination of PyTorch 2.0 and Deepspeed. However, I've only come across this problem when using Deepspeed with PyTorch 2.0. The problem resulted in a mismatched data type for the input_tensor and output_tensor in Deepspeed's runtime/zero/partition_parameters.py function def _dist_allgather_fn

one was torch.fp32 and the other was torch.fp16. I'm unsure whether PyTorch 2.0 is the root cause of this issue. By any chance, have you also used PyTorch 2.0?

Here's the related piece of code:

def _dist_allgather_fn(input_tensor: Tensor, output_tensor: Tensor, group=None): # print(input_tensor.dtype,output_tensor.dtype) return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)

Jul 07 '23 15:07 ShadowTeamCN

@ShadowTeamCN or @DogeWatch do you have a script I can use to reproduce this error with torch 2.0?

Jul 10 '23 20:07 jomayeri

I resolved the issue by modifying the Trainer arguments from --bf16 to --fp16. I'm currently utilizing the combination of PyTorch 2.0 and Deepspeed. However, I've only come across this problem when using Deepspeed with PyTorch 2.0. The problem resulted in a mismatched data type for the input_tensor and output_tensor in Deepspeed's runtime/zero/partition_parameters.py function def _dist_allgather_fn

one was torch.fp32 and the other was torch.fp16. I'm unsure whether PyTorch 2.0 is the root cause of this issue. By any chance, have you also used PyTorch 2.0?

Here's the related piece of code:

def _dist_allgather_fn(input_tensor: Tensor, output_tensor: Tensor, group=None):

print(input_tensor.dtype,output_tensor.dtype)

return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)

Switch to --fp16 works in my environment:

torch: 1.13.1
GPU: 2*8 A800-80G
deepspeed: 0.10.0
stage3 + zero++
deepspeed config

{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "fp16": {"enabled": true},
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "last_batch_iteration": -1,
      "total_num_steps": "auto",
      "warmup_min_lr": 5e-7,
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {"device": "cpu", "pin_memory": true},
        "offload_optimizer": {"device": "cpu", "pin_memory": true},
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "contiguous_gradients": true,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,

        "zero_quantized_weights": true,
        "zero_hpz_partition_size": 8,
        "zero_quantized_gradients": true
    }
}

updates:

--bf16 works after change zero_quantized_weights to false

Jul 26 '23 04:07 Sanster

Appears resolved with the correct flag.

Aug 11 '23 03:08 jomayeri

Same issue here! --bf16 yields the error while --fp16 does not.

Aug 15 '23 07:08 namespace-Pt

显示已解决，并带有正确的标志。

Has the problem been resolved？

Jan 20 '24 03:01 vip-china

After changing the config following @Sanster, it works well with llama-7b model using bf16. But it doesn't work with mistralai/Mixtral-8x7B-Instruct-v0.1. Any solution?

Feb 01 '24 06:02 JY-CCK

I have the same issue with QLORA finetuning for Mixtral. Using @Sanster's config, it was running and then all GPUs hang with 100% memory being used without proceeding.

Feb 05 '24 00:02 duyvuleo

#5049

Feb 05 '24 11:02 duyvuleo

DeepSpeed DeepSpeed copied to clipboard

[BUG]RuntimeError: output tensor must have the same type as input tensor

print(input_tensor.dtype,output_tensor.dtype)

DeepSpeed
DeepSpeed copied to clipboard