DeepSpeed [BUG] Deepspeed Zero 3 Inference InFlight Params with new HuggingFace Mixtral Model

Describe the bug I tried running deepspeed zero 3 on a new huggingface model and got the following error:

      [2023-12-13 04:12:18,837] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
      Invalidate trace cache @ step 14: expected module 19, but got module 34
      Traceback (most recent call last):
        File "/home/ubuntu/mixtral_hf/deepspeed_zero.py", line 36, in <module>
          outputs = model.generate(inputs, max_new_tokens=20)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
          return func(*args, **kwargs)
        File "/home/ubuntu/mixtral_hf/transformers/src/transformers/generation/utils.py", line 1731, in generate
          return self.greedy_search(
        File "/home/ubuntu/mixtral_hf/transformers/src/transformers/generation/utils.py", line 2592, in greedy_search
          outputs = self(
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1581, in _call_impl
          hook_result = hook(self, args, result)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
          ret_val = func(*args, **kwargs)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 350, in _end_of_forward_hook
          self.get_param_coordinator(training=False).reset_step()
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 203, in reset_step
          raise RuntimeError(f"still have inflight params "
          RuntimeError: still have inflight params [{'id': 9, 'status': 'AVAILABLE', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 11, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 10, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 15, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 17, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 16, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 21, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 23, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 22, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 27, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}]

To Reproduce Steps to reproduce the behavior:

Simple inference script to reproduce:

  model_id = "mistralai/Mixtral-8x7B-v0.1"
  ds_config = {
      "bf16": {
          "enabled": True,
      },
      "zero_optimization": {
          "stage": 3,
          "offload_param": {
              "device": "cpu",
          }
      },
      "train_micro_batch_size_per_gpu": 1,
  }
  
  hfdsc = HfDeepSpeedConfig(ds_config)
  
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
  model.eval()
  
  ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
  ds_engine.module.eval()
  model = ds_engine.module
  
  inputs = tokenizer.encode("DeepSpeed is", return_tensors="pt").to("cuda")
  outputs = model.generate(inputs, max_new_tokens=20)   
  output_str = tokenizer.decode(outputs[0])

What packages are required and their versions

HuggingFace 4.65
Deepspeed 0.12.4
Torch 2.1
Cuda 12.1

ds_report output Please run ds_report to give us details about your setup.

    DeepSpeed C++/CUDA extension op report
    --------------------------------------------------
    NOTE: Ops not installed will be just-in-time (JIT) compiled at
          runtime if needed. Op compatibility means that your system
          meet the required dependencies to JIT install the op.
    --------------------------------------------------
    JIT compiled ops requires ninja
    ninja .................. [OKAY]
    --------------------------------------------------
    op name ................ installed .. compatible
    --------------------------------------------------
    async_io ............... [NO] ....... [OKAY]
    fused_adam ............. [NO] ....... [OKAY]
    cpu_adam ............... [NO] ....... [OKAY]
    cpu_adagrad ............ [NO] ....... [OKAY]
    cpu_lion ............... [NO] ....... [OKAY]
     [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
    evoformer_attn ......... [NO] ....... [NO]
    fused_lamb ............. [NO] ....... [OKAY]
    fused_lion ............. [NO] ....... [OKAY]
    inference_core_ops ..... [NO] ....... [OKAY]
    cutlass_ops ............ [NO] ....... [OKAY]
    quantizer .............. [NO] ....... [OKAY]
    ragged_device_ops ...... [NO] ....... [OKAY]
    ragged_ops ............. [NO] ....... [OKAY]
    random_ltd ............. [NO] ....... [OKAY]
     [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
     [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
    sparse_attn ............ [NO] ....... [NO]
    spatial_inference ...... [NO] ....... [OKAY]
    transformer ............ [NO] ....... [OKAY]
    stochastic_transformer . [NO] ....... [OKAY]
    transformer_inference .. [NO] ....... [OKAY]
    --------------------------------------------------
    DeepSpeed general environment info:
    torch install path ............... ['/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch']
    torch version .................... 2.1.1
    deepspeed install path ........... ['/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed']
    deepspeed info ................... 0.12.4, unknown, unknown
    torch cuda version ............... 12.1
    torch hip version ................ None
    nvcc version ..................... 12.1
    deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
    shared memory (/dev/shm) size .... 124.52 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

AWS g5.16x large instance
OS: Ubuntu 22.04
GPU: Nvidia A10G
OS: [e.g. Ubuntu 18.04]
GPU count: 1
Python version: 3.10.13

Dec 13 '23 04:12 ryandeng1

same question

Dec 17 '23 11:12 Chenzongchao

Same problem

Dec 17 '23 23:12 ikergarcia1996

same problem

Dec 18 '23 08:12 Yuhuajoe

same problem It's similar to https://github.com/microsoft/DeepSpeed/issues/4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.
I try to use "stage3_prefetch_bucket_size": 0 as described in https://github.com/microsoft/DeepSpeed/issues/4094. After that, it does not report any error, but reports some warnings

[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be

Dec 19 '23 11:12 LZHgrla

Changing that parameter fundamentally changes the model

same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.
I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings

[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be

Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.

Dec 20 '23 00:12 ryandeng1

Changing that parameter fundamentally changes the model
same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.

I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings
[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.

Yes

Dec 20 '23 02:12 LZHgrla

Changing that parameter fundamentally changes the model
same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.

I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings
[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Yes

Will that impact the performance?

Dec 20 '23 10:12 mynewstart

Changing that parameter fundamentally changes the model
same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.

I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings
[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Yes
Will that impact the performance?

Of course :)

Dec 20 '23 10:12 LZHgrla

Will the inference speed slow down if I understand correctly, and will the model's performance deteriorate?

Changing that parameter fundamentally changes the model
same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.

I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings
[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Yes
Will that impact the performance?
Of course :)

Will the inference speed slow down if I understand correctly, and will the model's performance deteriorate?

Dec 20 '23 13:12 mynewstart

same question

Jan 03 '24 11:01 tuyaao

same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.
I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings

[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be

The first solution may deteriorate performance as many have doubted. The second solution works, however the inference runs extremely slowly, with a bunch of warnings yelling "Invalidate trace cache @ step 14: expected module 14, but got module xxx".

Jan 07 '24 15:01 jingwangsg

Guys, thanks for the great debugging and collaboration here to understand this problem. The fundamental issue is that zero3 caches the parameter trace to enable parameter prefetching to reduce all-gather latency. Unfortunately, since MoE layers can activate different experts across iterations, the parameter trace cache is invalidated when the expert changes. The warning messages are for the trace cache invalidations. In this case, the warning is avoidable since prefetching is disabled by setting "stage3_prefetch_bucket_size": 0, so a minor fix is required in this case. However, in general inference speed will be very slow as observed.

We have not previously tested zero3 and MoE, but we will prioritize this investigation now given the interest.

Jan 07 '24 23:01 tjruwase

I got the error with "stage_prefetch_bucket_size": 0 + zero3

Invalidate trace cache @ step 1323: expected module 2476, but got module 2510                                    | 20/2466 [02:15<4:12:07,  6.18s/it, gpt_loss=1.28, loss_mean=1.22, balancing_loss=8]







[rank0]:[E ProcessGroupNCCL.cpp:754] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151178, OpType=_ALLGATHER_BASE, NumelIn=65536, NumelOut=262144, Timeout(ms)=1800000) ran for 1800326 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:768] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:774] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1282] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151178, OpType=_ALLGATHER_BASE, NumelIn=65536, NumelOut=262144, Timeout(ms)=1800000) ran for 1800326 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:756 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7f9ddd19c8f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(c10::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f2 (0x7f9d7ef58142 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x178 (0x7f9d7ef5e538 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x8e (0x7f9d7ef5eb2e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f9ddccb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f9e8ad78ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126660 (0x7f9e8ae0a660 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Jan 11 '24 08:01 hijkzzz

I am also observing the same issue even with "stage_prefetch_bucket_size": 0. The runtime error about inflight parameters does not occur but the process just hangs indefinitely and crashes at the end with timeout.

Did someone manage to fine-tune Mixtral with zero3 and huggingface? Could you share your deepspeed config? @K-Nick @LZHgrla @ryandeng1

Jan 11 '24 10:01 BBerabi

@BBerabi You can try it with xtuner, https://github.com/InternLM/xtuner/tree/main/xtuner/configs/mixtral

But remember that, using deepspeed_zero3 instead of deepspeed_zero3_offload

Jan 11 '24 16:01 LZHgrla

I can fully fine-tune Mistral7b*8 instruct with deepspeed zero3 on 2 A100-80GB instances, the code won't hook and run smoothly. I didn't change anything except disabling the evaluation part to calculate ppl for val data set. The fine-tuned model looks normal but I still don't know why it can happen. I just provide my training environment for your inference. Transformer version: 4.36.2, deepspeed 0.12.5, deepspeed zero_3 config:

  "gradient_accumulation_steps": 8,
  "train_micro_batch_size_per_gpu": 4,
  "prescale_gradients": false,
  "zero_allow_untested_optimizer": true,
  "zero_optimization": {
    "stage": 3, 
    "offload_param": {
        "device": "none"
    }, 
    "offload_optimizer": {
        "device": "none"
    }, 
    "stage3_param_persistence_threshold": 1.000000e+04, 
    "stage3_max_live_parameters": 3.000000e+07, 
    "stage3_prefetch_bucket_size": 3.000000e+07, 
    "memory_efficient_linear": false
  }, 
  "steps_per_print": 1,
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": true,
  "bf16": {
    "enabled": true
  }
}```

Jan 14 '24 08:01 mynewstart

I have same question...how resolve it ?

Jan 15 '24 09:01 awzhgw

Hi all, If you want to generate text with Mixtral, DeepSpeed-FastGen (DeepSpeed-MII) will be the first choice. The example is available here. I verified that Mixtral works just by modifying the model name.

It is easier to use "non-persistent" mode for testing purpose, but "persistent" mode will give you the best performance. Please refer to DeepSpeed-MII for more details.

Jan 16 '24 22:01 tohtana

Is there any progress ?

Jan 18 '24 05:01 hijkzzz

Hi @hijkzzz and all, #4966 should have fixed this issue. You can find working example there. The PR was already merged into master. Please feel free to try, but I still recommend using DeepSpeed-FastGen for text generation. It is much faster and supports Mixtral.

Jan 19 '24 17:01 tohtana

@mynewstart Hey, would u mind to share your complete deepspeed config?

Jan 23 '24 03:01 xs1997zju

@tohtana In my testing of Mixtral fine-tune phrase using Zero3, training process hanged at step5 for the same datasets. This patch seems not fixed my hang issue during training. As you declared, this patch should have fixed for text generation issue using Zero3.

After my debugging, I found the hang probably are related to these lines from MixtralSparseMoeBlock implementation as follows and hangs happened when some experts have been assigned to no tokens in training batch. https://github.com/huggingface/transformers/blob/e547458c43dfdbbb8f6a7757237e234c44e20a8f/src/transformers/models/mixtral/modeling_mixtral.py#L823-L824

Could you please give me some explanation about why this implementation caused hang using Zero3? (Zero2 runs normally). Thanks for your reply.

Jan 23 '24 09:01 ftgreat

Thank you for sharing the issue, @ftgreat. The same issue is reported at #4966. Let me take a look.

Jan 23 '24 17:01 tohtana

Thank you for sharing the issue, @ftgreat. The same issue is reported at #4966. Let me take a look.

@tohtana I wrote a monkey patch using dense moe impl instead of mixtral sparse moe. Tested ok for my cases, no hangs happend. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py

Still wanna detailed explanations about the cause of sparse moe impl. Thanks.

Jan 24 '24 02:01 ftgreat

@ftgreat The root cause of this issue is that DeepSpeed tries to run reduce-scatter for only a part of experts.

ZeRO3 sets hooks on parameters to run reduce-scatter. However, the hook is not fired unless the expert is activated at a forward pass. Our data parallel processes may activate different sets of experts. We need all processes to join such a communication collective, but the reduce-scatter is called only on some processes in this case.

Since we already implemented the API to set a leaf module for ZeRO3, the solution will be to delay reduce-scatter until the backward pass of the leaf module finishes. I will work on this direction.

Jan 24 '24 02:01 tohtana

@ftgreat Hello, I would like to know if your monkey patch can achieve the same results as the original mixtral forward.

Is this method currently the best approach right?

Jan 24 '24 06:01 penpaperkeycode

https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py

Add some unittest cases. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py#L69

Jan 24 '24 12:01 ftgreat

I've opened a PR (#5008) to fix the issue causing hangs during backward passes. Please feel free to test it with your model.

Jan 25 '24 00:01 tohtana

https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py

Add some unittest cases. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py#L69

I tried the code and run without error, but the loss is all 0. and grad is 1?

{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.15630008137498733, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15629951739661085, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15629945559100883, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15629979895608306, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.1562995422905477, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15630035091860164, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.3167301641335448, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.31672936396061685, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.3167279821737141, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.31672982925892, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.3167278887625346, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.31672019927269446, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.3167748692022585, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31677583885383076, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31677668333703635, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.3167755179866533, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31677014443953183, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31678280112644813, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}

Jan 25 '24 06:01 Sniper970119

Any update on this?

Feb 13 '24 13:02 nilsec

DeepSpeed DeepSpeed copied to clipboard

[BUG] Deepspeed Zero 3 Inference InFlight Params with new HuggingFace Mixtral Model

DeepSpeed
DeepSpeed copied to clipboard