text-generation-inference
text-generation-inference copied to clipboard
Running Qwen2-VL-2B-Instruct on TGI is giving an error
trafficstars
System Info
docker run --gpus all --shm-size 1g -p 8080:80 -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
ghcr.io/huggingface/text-generation-inference:2.4.1 \
--model-id Qwen/Qwen2-VL-2B-Instruct --trust-remote-code \
--quantize bitsandbytes-nf4 --cuda-graphs 0
The above command is giving the following error:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 268, in serve_inner
model = get_model_with_lora_adapters(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 1336, in get_model_with_lora_adapters
model = get_model(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 1184, in get_model
return VlmCausalLM(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/vlm_causal_lm.py", line 290, in __init__
super().__init__(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1287, in __init__
model = model_class(prefix, config, weights)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/qwen2_vl.py", line 392, in __init__
self.text_model = Qwen2Model(prefix=None, config=config, weights=weights)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 290, in __init__
[
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 291, in <listcomp>
Qwen2Layer(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 227, in __init__
self.self_attn = Qwen2Attention(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 101, in __init__
self.num_groups = self.num_heads // self.num_key_value_heads
ZeroDivisionError: integer division or modulo by zero
When I am running the same command with 7B model, its working fine.
Information
- [x] Docker
- [ ] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
docker run --gpus all --shm-size 1g -p 8080:80 -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
ghcr.io/huggingface/text-generation-inference:2.4.1 \
--model-id Qwen/Qwen2-VL-2B-Instruct --trust-remote-code \
--quantize bitsandbytes-nf4 --cuda-graphs 0
Expected behavior
should work as expectd