DeepSpeed
DeepSpeed copied to clipboard
[BUG] Deepspeed Zero 3 Inference InFlight Params with new HuggingFace Mixtral Model
Describe the bug I tried running deepspeed zero 3 on a new huggingface model and got the following error:
[2023-12-13 04:12:18,837] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Traceback (most recent call last):
File "/home/ubuntu/mixtral_hf/deepspeed_zero.py", line 36, in <module>
outputs = model.generate(inputs, max_new_tokens=20)
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/mixtral_hf/transformers/src/transformers/generation/utils.py", line 1731, in generate
return self.greedy_search(
File "/home/ubuntu/mixtral_hf/transformers/src/transformers/generation/utils.py", line 2592, in greedy_search
outputs = self(
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1581, in _call_impl
hook_result = hook(self, args, result)
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 350, in _end_of_forward_hook
self.get_param_coordinator(training=False).reset_step()
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 203, in reset_step
raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [{'id': 9, 'status': 'AVAILABLE', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 11, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 10, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 15, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 17, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 16, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 21, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 23, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 22, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 27, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}]
To Reproduce Steps to reproduce the behavior:
Simple inference script to reproduce:
model_id = "mistralai/Mixtral-8x7B-v0.1"
ds_config = {
"bf16": {
"enabled": True,
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
}
},
"train_micro_batch_size_per_gpu": 1,
}
hfdsc = HfDeepSpeedConfig(ds_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval()
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()
model = ds_engine.module
inputs = tokenizer.encode("DeepSpeed is", return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=20)
output_str = tokenizer.decode(outputs[0])
What packages are required and their versions
- HuggingFace 4.65
- Deepspeed 0.12.4
- Torch 2.1
- Cuda 12.1
ds_report output
Please run ds_report
to give us details about your setup.
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch']
torch version .................... 2.1.1
deepspeed install path ........... ['/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.12.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 124.52 GB
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- AWS g5.16x large instance
- OS: Ubuntu 22.04
- GPU: Nvidia A10G
- OS: [e.g. Ubuntu 18.04]
- GPU count: 1
- Python version: 3.10.13
same question
Same problem
same problem
same problem It's similar to https://github.com/microsoft/DeepSpeed/issues/4094
- I modify the
num_experts_per_tok
to 8 inconfig.json
of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters. - I try to use
"stage3_prefetch_bucket_size": 0
as described in https://github.com/microsoft/DeepSpeed/issues/4094. After that, it does not report any error, but reports some warnings
[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model
same problem It's similar to #4094
- I modify the
num_experts_per_tok
to 8 inconfig.json
of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.- I try to use
"stage3_prefetch_bucket_size": 0
as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly. Invalidate trace cache @ step 14: expected module 19, but got module 34 Invalidate trace cache @ step 19: expected module 44, but got module 34 Invalidate trace cache @ step 56: expected module 133, but got module 123 Invalidate trace cache @ step 14: expected module 14, but got module 24 Invalidate trace cache @ step 14: expected module 34, but got module 14 Invalidate trace cache @ step 14: expected module 29, but got module 14 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 29, but got module 19 Invalidate trace cache @ step 61: expected module 148, but got module 133 <s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Changing that parameter fundamentally changes the model
same problem It's similar to #4094
- I modify the
num_experts_per_tok
to 8 inconfig.json
of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.- I try to use
"stage3_prefetch_bucket_size": 0
as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly. Invalidate trace cache @ step 14: expected module 19, but got module 34 Invalidate trace cache @ step 19: expected module 44, but got module 34 Invalidate trace cache @ step 56: expected module 133, but got module 123 Invalidate trace cache @ step 14: expected module 14, but got module 24 Invalidate trace cache @ step 14: expected module 34, but got module 14 Invalidate trace cache @ step 14: expected module 29, but got module 14 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 29, but got module 19 Invalidate trace cache @ step 61: expected module 148, but got module 133 <s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Yes
Changing that parameter fundamentally changes the model
same problem It's similar to #4094
- I modify the
num_experts_per_tok
to 8 inconfig.json
of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.- I try to use
"stage3_prefetch_bucket_size": 0
as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly. Invalidate trace cache @ step 14: expected module 19, but got module 34 Invalidate trace cache @ step 19: expected module 44, but got module 34 Invalidate trace cache @ step 56: expected module 133, but got module 123 Invalidate trace cache @ step 14: expected module 14, but got module 24 Invalidate trace cache @ step 14: expected module 34, but got module 14 Invalidate trace cache @ step 14: expected module 29, but got module 14 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 29, but got module 19 Invalidate trace cache @ step 61: expected module 148, but got module 133 <s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Yes
Will that impact the performance?
Changing that parameter fundamentally changes the model
same problem It's similar to #4094
- I modify the
num_experts_per_tok
to 8 inconfig.json
of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.- I try to use
"stage3_prefetch_bucket_size": 0
as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly. Invalidate trace cache @ step 14: expected module 19, but got module 34 Invalidate trace cache @ step 19: expected module 44, but got module 34 Invalidate trace cache @ step 56: expected module 133, but got module 123 Invalidate trace cache @ step 14: expected module 14, but got module 24 Invalidate trace cache @ step 14: expected module 34, but got module 14 Invalidate trace cache @ step 14: expected module 29, but got module 14 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 29, but got module 19 Invalidate trace cache @ step 61: expected module 148, but got module 133 <s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Yes
Will that impact the performance?
Of course :)
Will the inference speed slow down if I understand correctly, and will the model's performance deteriorate?
Changing that parameter fundamentally changes the model
same problem It's similar to #4094
- I modify the
num_experts_per_tok
to 8 inconfig.json
of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.- I try to use
"stage3_prefetch_bucket_size": 0
as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly. Invalidate trace cache @ step 14: expected module 19, but got module 34 Invalidate trace cache @ step 19: expected module 44, but got module 34 Invalidate trace cache @ step 56: expected module 133, but got module 123 Invalidate trace cache @ step 14: expected module 14, but got module 24 Invalidate trace cache @ step 14: expected module 34, but got module 14 Invalidate trace cache @ step 14: expected module 29, but got module 14 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 29, but got module 19 Invalidate trace cache @ step 61: expected module 148, but got module 133 <s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Yes
Will that impact the performance?
Of course :)
Will the inference speed slow down if I understand correctly, and will the model's performance deteriorate?
same question
same problem It's similar to #4094
- I modify the
num_experts_per_tok
to 8 inconfig.json
of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.- I try to use
"stage3_prefetch_bucket_size": 0
as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly. Invalidate trace cache @ step 14: expected module 19, but got module 34 Invalidate trace cache @ step 19: expected module 44, but got module 34 Invalidate trace cache @ step 56: expected module 133, but got module 123 Invalidate trace cache @ step 14: expected module 14, but got module 24 Invalidate trace cache @ step 14: expected module 34, but got module 14 Invalidate trace cache @ step 14: expected module 29, but got module 14 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 14, but got module 29 Invalidate trace cache @ step 14: expected module 29, but got module 19 Invalidate trace cache @ step 61: expected module 148, but got module 133 <s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
The first solution may deteriorate performance as many have doubted. The second solution works, however the inference runs extremely slowly, with a bunch of warnings yelling "Invalidate trace cache @ step 14: expected module 14, but got module xxx".
Guys, thanks for the great debugging and collaboration here to understand this problem. The fundamental issue is that zero3 caches the parameter trace to enable parameter prefetching to reduce all-gather latency. Unfortunately, since MoE layers can activate different experts across iterations, the parameter trace cache is invalidated when the expert changes. The warning messages are for the trace cache invalidations. In this case, the warning is avoidable since prefetching is disabled by setting "stage3_prefetch_bucket_size": 0
, so a minor fix is required in this case. However, in general inference speed will be very slow as observed.
We have not previously tested zero3 and MoE, but we will prioritize this investigation now given the interest.
I got the error with "stage_prefetch_bucket_size": 0 + zero3
Invalidate trace cache @ step 1323: expected module 2476, but got module 2510 | 20/2466 [02:15<4:12:07, 6.18s/it, gpt_loss=1.28, loss_mean=1.22, balancing_loss=8]
[rank0]:[E ProcessGroupNCCL.cpp:754] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151178, OpType=_ALLGATHER_BASE, NumelIn=65536, NumelOut=262144, Timeout(ms)=1800000) ran for 1800326 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:768] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:774] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1282] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151178, OpType=_ALLGATHER_BASE, NumelIn=65536, NumelOut=262144, Timeout(ms)=1800000) ran for 1800326 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:756 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7f9ddd19c8f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(c10::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f2 (0x7f9d7ef58142 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x178 (0x7f9d7ef5e538 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x8e (0x7f9d7ef5eb2e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f9ddccb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f9e8ad78ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126660 (0x7f9e8ae0a660 in /usr/lib/x86_64-linux-gnu/libc.so.6)
I am also observing the same issue even with "stage_prefetch_bucket_size": 0
. The runtime error about inflight parameters does not occur but the process just hangs indefinitely and crashes at the end with timeout.
Did someone manage to fine-tune Mixtral with zero3 and huggingface? Could you share your deepspeed config? @K-Nick @LZHgrla @ryandeng1
@BBerabi You can try it with xtuner, https://github.com/InternLM/xtuner/tree/main/xtuner/configs/mixtral
But remember that, using deepspeed_zero3
instead of deepspeed_zero3_offload
I can fully fine-tune Mistral7b*8 instruct with deepspeed zero3 on 2 A100-80GB instances, the code won't hook and run smoothly. I didn't change anything except disabling the evaluation part to calculate ppl for val data set. The fine-tuned model looks normal but I still don't know why it can happen. I just provide my training environment for your inference. Transformer version: 4.36.2, deepspeed 0.12.5, deepspeed zero_3 config:
"gradient_accumulation_steps": 8,
"train_micro_batch_size_per_gpu": 4,
"prescale_gradients": false,
"zero_allow_untested_optimizer": true,
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "none"
},
"offload_optimizer": {
"device": "none"
},
"stage3_param_persistence_threshold": 1.000000e+04,
"stage3_max_live_parameters": 3.000000e+07,
"stage3_prefetch_bucket_size": 3.000000e+07,
"memory_efficient_linear": false
},
"steps_per_print": 1,
"gradient_clipping": 1.0,
"wall_clock_breakdown": true,
"bf16": {
"enabled": true
}
}```
I have same question...how resolve it ?
Hi all, If you want to generate text with Mixtral, DeepSpeed-FastGen (DeepSpeed-MII) will be the first choice. The example is available here. I verified that Mixtral works just by modifying the model name.
It is easier to use "non-persistent" mode for testing purpose, but "persistent" mode will give you the best performance. Please refer to DeepSpeed-MII for more details.
Is there any progress ?
Hi @hijkzzz and all, #4966 should have fixed this issue. You can find working example there. The PR was already merged into master. Please feel free to try, but I still recommend using DeepSpeed-FastGen for text generation. It is much faster and supports Mixtral.
@mynewstart Hey, would u mind to share your complete deepspeed config?
@tohtana In my testing of Mixtral fine-tune phrase using Zero3, training process hanged at step5 for the same datasets. This patch seems not fixed my hang issue during training. As you declared, this patch should have fixed for text generation issue using Zero3.
After my debugging, I found the hang probably are related to these lines from MixtralSparseMoeBlock implementation as follows and hangs happened when some experts have been assigned to no tokens in training batch. https://github.com/huggingface/transformers/blob/e547458c43dfdbbb8f6a7757237e234c44e20a8f/src/transformers/models/mixtral/modeling_mixtral.py#L823-L824
Could you please give me some explanation about why this implementation caused hang using Zero3? (Zero2 runs normally). Thanks for your reply.
Thank you for sharing the issue, @ftgreat. The same issue is reported at #4966. Let me take a look.
Thank you for sharing the issue, @ftgreat. The same issue is reported at #4966. Let me take a look.
@tohtana I wrote a monkey patch using dense moe impl instead of mixtral sparse moe. Tested ok for my cases, no hangs happend. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py
Still wanna detailed explanations about the cause of sparse moe impl. Thanks.
@ftgreat The root cause of this issue is that DeepSpeed tries to run reduce-scatter for only a part of experts.
ZeRO3 sets hooks on parameters to run reduce-scatter. However, the hook is not fired unless the expert is activated at a forward pass. Our data parallel processes may activate different sets of experts. We need all processes to join such a communication collective, but the reduce-scatter is called only on some processes in this case.
Since we already implemented the API to set a leaf module for ZeRO3, the solution will be to delay reduce-scatter until the backward pass of the leaf module finishes. I will work on this direction.
@ftgreat Hello, I would like to know if your monkey patch can achieve the same results as the original mixtral forward.
Is this method currently the best approach right?
https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py
Add some unittest cases. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py#L69
I've opened a PR (#5008) to fix the issue causing hangs during backward passes. Please feel free to test it with your model.
https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py
Add some unittest cases. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py#L69
I tried the code and run without error, but the loss is all 0. and grad is 1?
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.15630008137498733, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15629951739661085, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15629945559100883, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15629979895608306, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.1562995422905477, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15630035091860164, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.3167301641335448, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.31672936396061685, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.3167279821737141, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.31672982925892, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.3167278887625346, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.31672019927269446, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.3167748692022585, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31677583885383076, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31677668333703635, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.3167755179866533, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31677014443953183, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31678280112644813, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
Any update on this?