text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Running Qwen2-VL-2B-Instruct on TGI is giving an error

Open ashwani-bhat opened this issue 9 months ago • 0 comments
trafficstars

System Info

docker run --gpus all --shm-size 1g -p 8080:80 -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
ghcr.io/huggingface/text-generation-inference:2.4.1 \
--model-id Qwen/Qwen2-VL-2B-Instruct --trust-remote-code \
--quantize bitsandbytes-nf4 --cuda-graphs 0

The above command is giving the following error:


Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 268, in serve_inner
    model = get_model_with_lora_adapters(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 1336, in get_model_with_lora_adapters
    model = get_model(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 1184, in get_model
    return VlmCausalLM(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/vlm_causal_lm.py", line 290, in __init__
    super().__init__(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1287, in __init__
    model = model_class(prefix, config, weights)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/qwen2_vl.py", line 392, in __init__
    self.text_model = Qwen2Model(prefix=None, config=config, weights=weights)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 290, in __init__
    [
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 291, in <listcomp>
    Qwen2Layer(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 227, in __init__
    self.self_attn = Qwen2Attention(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 101, in __init__
    self.num_groups = self.num_heads // self.num_key_value_heads
ZeroDivisionError: integer division or modulo by zero

When I am running the same command with 7B model, its working fine.

Information

  • [x] Docker
  • [ ] The CLI directly

Tasks

  • [x] An officially supported command
  • [ ] My own modifications

Reproduction

docker run --gpus all --shm-size 1g -p 8080:80 -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
ghcr.io/huggingface/text-generation-inference:2.4.1 \
--model-id Qwen/Qwen2-VL-2B-Instruct --trust-remote-code \
--quantize bitsandbytes-nf4 --cuda-graphs 0

Expected behavior

should work as expectd

ashwani-bhat avatar Jan 27 '25 09:01 ashwani-bhat