djl
djl copied to clipboard
Error when invoking the `chat/completion` API on a Qwen model via DJL
Description
I am serving a Qwen model via DJL container.
docker run --runtime nvidia --gpus all -p 8000:8000 -v /home/ubuntu/s1/ckpts/s1-20250213_094556:/opt/ml/model
--ipc=host deepjavalibrary/djl-serving:0.32.0-pytorch-gpu
I use the openai client library to generate an inference from the model using the following code
import openai
client = openai.OpenAI(base_url=f"http://127.0.0.1:8080/v1", api_key="None")
response = client.chat.completions.create(
model="model",
messages=[
{"role": "user", "content": """how much is 2+2?"""},
],
temperature=0.6,
max_tokens=4096,
)
Expected Behavior
I get an answer from the model
Error Message
In the client:
APIStatusError: Error code: 424 - {'code': 424, 'message': 'invoke handler failure', 'error': "The following `model_kwargs` are not used by the model: ['frequency_penalty', 'presence_penalty', 'ignore_eos'] (note: typos in the generate arguments will also show up in this list)"}
In the container logs:
INFO PyProcess W-123-model-stdout: ERROR::Failed invoke service.invoke_handler()
INFO PyProcess W-123-model-stdout: Traceback (most recent call last):
INFO PyProcess W-123-model-stdout: File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python_engine.py", line 161, in run_server
INFO PyProcess W-123-model-stdout: outputs = self.service.invoke_handler(function_name, inputs)
INFO PyProcess W-123-model-stdout: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-model-stdout: File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python/service_loader.py", line 30, in invoke_handler
INFO PyProcess W-123-model-stdout: return getattr(self.module, function_name)(inputs)
INFO PyProcess W-123-model-stdout: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-model-stdout: File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python/huggingface.py", line 660, in handle
INFO PyProcess W-123-model-stdout: return _service.inference(inputs)
INFO PyProcess W-123-model-stdout: ^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-model-stdout: File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python/huggingface.py", line 242, in inference
INFO PyProcess W-123-model-stdout: return self._dynamic_batch_inference(parsed_input.batch, errors,
INFO PyProcess W-123-model-stdout: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-model-stdout: File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python/huggingface.py", line 257, in _dynamic_batch_inference
INFO PyProcess W-123-model-stdout: prediction = self.hf_pipeline(input_data, **parameters)
INFO PyProcess W-123-model-stdout: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-model-stdout: File "/tmp/.djl.ai/python/0.32.0-SNAPSHOT/djl_python/huggingface.py", line 458, in wrapped_pipeline
INFO PyProcess W-123-model-stdout: output_tokens = model.generate(
INFO PyProcess W-123-model-stdout: ^^^^^^^^^^^^^^^
INFO PyProcess W-123-model-stdout: File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
INFO PyProcess W-123-model-stdout: return func(*args, **kwargs)
INFO PyProcess W-123-model-stdout: ^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-model-stdout: File "/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py", line 2012, in generate
INFO PyProcess W-123-model-stdout: self._validate_model_kwargs(model_kwargs.copy())
INFO PyProcess W-123-model-stdout: File "/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py", line 1388, in _validate_model_kwargs
INFO PyProcess W-123-model-stdout: raise ValueError(
INFO PyProcess W-123-model-stdout: ValueError: The following `model_kwargs` are not used by the model: ['frequency_penalty', 'presence_penalty', 'ignore_eos'] (note: typos in the generate arguments will also show up in this list)
How to Reproduce?
Code above can be used to reproduce. You can use a 1.5B Qwen2.5 instruct model from HF
Steps to reproduce
- download the model
- host the model via DJL container
- use openai client to run an inference
What have you tried to solve it?
Environment Info
N/A DJL is run inside the official container on a g6e.12xlarge instance on EC2