djl icon indicating copy to clipboard operation
djl copied to clipboard

Error when invoking the `chat/completion` API on a Qwen model via DJL

Open massi-ang opened this issue 9 months ago • 0 comments

Description

I am serving a Qwen model via DJL container.

docker run --runtime nvidia --gpus all      -p 8000:8000  -v /home/ubuntu/s1/ckpts/s1-20250213_094556:/opt/ml/model   
--ipc=host    deepjavalibrary/djl-serving:0.32.0-pytorch-gpu

I use the openai client library to generate an inference from the model using the following code

import openai
client = openai.OpenAI(base_url=f"http://127.0.0.1:8080/v1", api_key="None")
response = client.chat.completions.create(
    model="model",
    messages=[
        {"role": "user", "content": """how much is 2+2?"""},
    ],
    temperature=0.6,
    max_tokens=4096,
)

Expected Behavior

I get an answer from the model

Error Message

In the client:

APIStatusError: Error code: 424 - {'code': 424, 'message': 'invoke handler failure', 'error': "The following `model_kwargs` are not used by the model: ['frequency_penalty', 'presence_penalty', 'ignore_eos'] (note: typos in the generate arguments will also show up in this list)"}

In the container logs:

INFO  PyProcess W-123-model-stdout: ERROR::Failed invoke service.invoke_handler()
INFO  PyProcess W-123-model-stdout: Traceback (most recent call last):
INFO  PyProcess W-123-model-stdout:   File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python_engine.py", line 161, in run_server
INFO  PyProcess W-123-model-stdout:     outputs = self.service.invoke_handler(function_name, inputs)
INFO  PyProcess W-123-model-stdout:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-model-stdout:   File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python/service_loader.py", line 30, in invoke_handler
INFO  PyProcess W-123-model-stdout:     return getattr(self.module, function_name)(inputs)
INFO  PyProcess W-123-model-stdout:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-model-stdout:   File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python/huggingface.py", line 660, in handle
INFO  PyProcess W-123-model-stdout:     return _service.inference(inputs)
INFO  PyProcess W-123-model-stdout:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-model-stdout:   File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python/huggingface.py", line 242, in inference
INFO  PyProcess W-123-model-stdout:     return self._dynamic_batch_inference(parsed_input.batch, errors,
INFO  PyProcess W-123-model-stdout:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-model-stdout:   File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python/huggingface.py", line 257, in _dynamic_batch_inference
INFO  PyProcess W-123-model-stdout:     prediction = self.hf_pipeline(input_data, **parameters)
INFO  PyProcess W-123-model-stdout:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-model-stdout:   File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python/huggingface.py", line 458, in wrapped_pipeline
INFO  PyProcess W-123-model-stdout:     output_tokens = model.generate(
INFO  PyProcess W-123-model-stdout:                     ^^^^^^^^^^^^^^^
INFO  PyProcess W-123-model-stdout:   File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
INFO  PyProcess W-123-model-stdout:     return func(*args, **kwargs)
INFO  PyProcess W-123-model-stdout:            ^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-model-stdout:   File "/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py", line 2012, in generate
INFO  PyProcess W-123-model-stdout:     self._validate_model_kwargs(model_kwargs.copy())
INFO  PyProcess W-123-model-stdout:   File "/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py", line 1388, in _validate_model_kwargs
INFO  PyProcess W-123-model-stdout:     raise ValueError(
INFO  PyProcess W-123-model-stdout: ValueError: The following `model_kwargs` are not used by the model: ['frequency_penalty', 'presence_penalty', 'ignore_eos'] (note: typos in the generate arguments will also show up in this list)

How to Reproduce?

Code above can be used to reproduce. You can use a 1.5B Qwen2.5 instruct model from HF

Steps to reproduce

  1. download the model
  2. host the model via DJL container
  3. use openai client to run an inference

What have you tried to solve it?

Environment Info

N/A DJL is run inside the official container on a g6e.12xlarge instance on EC2

massi-ang avatar Feb 18 '25 13:02 massi-ang