DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG]RuntimeError: output tensor must have the same type as input tensor

Open DogeWatch opened this issue 2 years ago • 2 comments

here is my config

{
    "bf_16": {
        "enable": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": 2e8,
        "stage3_prefetch_bucket_size": 50000000,
        "stage3_param_persistence_threshold": 100000,
        "sub_group_size": 1e9,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "auto"
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

and I use huggingface accelerate with 8 * A100 GPU, model is bigcode/starcoder error message is

File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 313, in allgather_fn
    return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
    return func(*args, **kwargs)
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 298, in all_gather_into_tensor
    return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 125, in all_gather_into_tensor
    return self.all_gather_function(output_tensor=output_tensor,
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper
    return func(*args, **kwargs)
  File "/data/home/miniconda3/envs/torch-2.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2517, in all_gather_into_tensor
    work = group._allgather_base(output_tensor, input_tensor)
RuntimeError: output tensor must have the same type as input tensor

I print the dtype of input_tensor and output_tensor inject to https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/comm/comm.py#L312, dtype of output_tensor is torch.float64 and dtype for input_tensor is torch.float32 if I use single GPU, it would not have this error but get out of GPU memory error

DogeWatch avatar Jun 01 '23 06:06 DogeWatch

@DogeWatch the format for enabling bf16 in your config is incorrect it should be "bf16": { "enabled": true },

jomayeri avatar Jun 02 '23 18:06 jomayeri

@DogeWatch the format for enabling bf16 in your config is incorrect it should be "bf16": { "enabled": true },

@jomayeri sorry, it's my mistake, but with "bf16" it still have this error

DogeWatch avatar Jun 05 '23 11:06 DogeWatch

I have the same error, any workaround?

flymin avatar Jun 23 '23 09:06 flymin

I resolved the issue by modifying the Trainer arguments from --bf16 to --fp16. I'm currently utilizing the combination of PyTorch 2.0 and Deepspeed. However, I've only come across this problem when using Deepspeed with PyTorch 2.0. The problem resulted in a mismatched data type for the input_tensor and output_tensor in Deepspeed's runtime/zero/partition_parameters.py function def _dist_allgather_fn

  • one was torch.fp32 and the other was torch.fp16. I'm unsure whether PyTorch 2.0 is the root cause of this issue. By any chance, have you also used PyTorch 2.0?

Here's the related piece of code:

def _dist_allgather_fn(input_tensor: Tensor, output_tensor: Tensor, group=None): # print(input_tensor.dtype,output_tensor.dtype) return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)

ShadowTeamCN avatar Jul 07 '23 15:07 ShadowTeamCN

@ShadowTeamCN or @DogeWatch do you have a script I can use to reproduce this error with torch 2.0?

jomayeri avatar Jul 10 '23 20:07 jomayeri

I resolved the issue by modifying the Trainer arguments from --bf16 to --fp16. I'm currently utilizing the combination of PyTorch 2.0 and Deepspeed. However, I've only come across this problem when using Deepspeed with PyTorch 2.0. The problem resulted in a mismatched data type for the input_tensor and output_tensor in Deepspeed's runtime/zero/partition_parameters.py function def _dist_allgather_fn

  • one was torch.fp32 and the other was torch.fp16. I'm unsure whether PyTorch 2.0 is the root cause of this issue. By any chance, have you also used PyTorch 2.0?

Here's the related piece of code:

def _dist_allgather_fn(input_tensor: Tensor, output_tensor: Tensor, group=None):

print(input_tensor.dtype,output_tensor.dtype)

return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)

Switch to --fp16 works in my environment:

  • torch: 1.13.1
  • GPU: 2*8 A800-80G
  • deepspeed: 0.10.0
  • stage3 + zero++
  • deepspeed config
{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "fp16": {"enabled": true},
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "last_batch_iteration": -1,
      "total_num_steps": "auto",
      "warmup_min_lr": 5e-7,
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {"device": "cpu", "pin_memory": true},
        "offload_optimizer": {"device": "cpu", "pin_memory": true},
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "contiguous_gradients": true,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,

        "zero_quantized_weights": true,
        "zero_hpz_partition_size": 8,
        "zero_quantized_gradients": true
    }
}

updates:

  • --bf16 works after change zero_quantized_weights to false

Sanster avatar Jul 26 '23 04:07 Sanster

Appears resolved with the correct flag.

jomayeri avatar Aug 11 '23 03:08 jomayeri

Same issue here! --bf16 yields the error while --fp16 does not.

namespace-Pt avatar Aug 15 '23 07:08 namespace-Pt

显示已解决,并带有正确的标志。

Has the problem been resolved?

vip-china avatar Jan 20 '24 03:01 vip-china

After changing the config following @Sanster, it works well with llama-7b model using bf16. But it doesn't work with mistralai/Mixtral-8x7B-Instruct-v0.1. Any solution?

JY-CCK avatar Feb 01 '24 06:02 JY-CCK

I have the same issue with QLORA finetuning for Mixtral. Using @Sanster's config, it was running and then all GPUs hang with 100% memory being used without proceeding.

duyvuleo avatar Feb 05 '24 00:02 duyvuleo

#5049

duyvuleo avatar Feb 05 '24 11:02 duyvuleo