DeepSpeed-MII icon indicating copy to clipboard operation
DeepSpeed-MII copied to clipboard

`ValueError: channels must be divisible by 8` when new special tokens are added

Open s-jse opened this issue 1 year ago • 3 comments

I can run the original LLaMA-2-7B model itself, and its fine-tuned versions without any issues. However, if a special token is added during fine-tuning, it cannot be loaded using mii. The model works just fine with vLLM, and HuggingFace's transformers and TGI. The same happens when testing Mistral-7B. The shortest code that reproduces the error is:

import mii
pipeline = mii.pipeline("stanford-oval/Llama-2-7b-WikiChat")
Traceback (most recent call last):
  File "/home/user1/llama/test.py", line 3, in <module>
    pipeline = mii.pipeline("./workdir/earlycombine_gpt4_fused_v3")
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/mii/api.py", line 156, in pipeline
    inference_engine = load_model(model_config)
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/mii/modeling/models.py", line 17, in load_model
    inference_engine = build_hf_engine(
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed/inference/v2/engine_factory.py", line 126, in build_hf_engine
    return InferenceEngineV2(policy, engine_config)
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed/inference/v2/engine_v2.py", line 83, in __init__
    self._model = self._policy.build_model(self._config, self._base_mp_group)
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed/inference/v2/model_implementations/inference_policy_base.py", line 156, in build_model
    self.model = self.instantiate_model(engine_config, mp_group)
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed/inference/v2/model_implementations/llama_v2/policy.py", line 17, in instantiate_model
    return Llama2InferenceModel(config=self._model_config, engine_config=engine_config, base_mp_group=mp_group)
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 222, in __init__
    self.make_unembedding_layer()
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 265, in make_unembedding_layer
    self.unembed = heuristics.instantiate_unembed(unembed_config, self._engine_config)
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed/inference/v2/modules/heuristics.py", line 179, in instantiate_unembed
    return DSUnembedRegistry.instantiate_config(config)
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed/inference/v2/modules/module_registry.py", line 39, in instantiate_config
    return cls.registry[config_bundle.name](config_bundle.config, config_bundle.implementation_config)
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed/inference/v2/modules/implementations/unembed/ragged_unembed.py", line 69, in __init__
    self._act_fn = CUDABiasActivation(self._config.vocab_size, self._config.dtype, ActivationType.IDENTITY)
  File "/home/user1/anaconda3/envs/llama/lib/python3.10/site-packages/deepspeed/inference/v2/kernels/core_ops/bias_activations/bias_activation.py", line 36, in __init__
    raise ValueError("channels must be divisible by 8")
ValueError: channels must be divisible by 8
GPU: NVIDIA A100 
Python: 3.10.13
deepspeed==0.13.0
deepspeed-kernels==0.0.1.dev1698255861
deepspeed-mii==0.2.0
torch==2.1.2+cu118

s-jse avatar Jan 22 '24 03:01 s-jse

hi ! have you been able to resolve it ?

No, I still get the same error.

s-jse avatar Jan 24 '24 17:01 s-jse

@s-jse thanks for reporting this issue! Currently The DeepSpeed-FastGen fused bias and activation kernel demands the number of channels be divisible by 8 as it takes advantage of vectorized instructions to achieve better performance!

Currently supported Llama models have vocab size of 32000 (any vocab size divisible by 8 should work!). The "stanford-oval/Llama-2-7b-WikiChat" (and any model with new special tokens added) has 32001 or more, which breaks our fused bias and activation kernel in unembedding layer.

We generalize this kernel to work with arbitrary channel sizes soon and let you know! Thanks!

arashb avatar Jan 25 '24 01:01 arashb

Any updates on this issue?

kaonashi-tyc avatar Mar 15 '24 23:03 kaonashi-tyc