djl-serving Default values for inference from generation

Description

We are using DJL container 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124 with vLLM as inference engine to serve Llama 3.1 - Llama 3.3 models. Models files include "generation_config.json" file which can specify default values for sampling parameters : temperature, top_p, top_k. The default inference values specified in the generation_config.json file are not being applied to the inference requests. Can it be implemented ?

We would like to populate the "generation_config.json" file with values that are performing best for the model. It seems that currently DJL ignores this file and uses defaults from https://github.com/deepjavalibrary/djl-serving/blob/7315729019480b004784b3f38c474509e2953e0e/engines/python/setup/djl_python/seq_scheduler/search_config.py#L19

Thank you.

Jan 22 '25 16:01 eduardzl

Thanks for reporting this issue. It looks like we will need to pass the generation_config.json file to the engine args in vllm https://docs.vllm.ai/en/latest/serving/engine_args.html. I will take a look at this and get back to you with a fix - I expect this to be available in the 0.32.0 release, scheduled for first week of February.

Jan 23 '25 00:01 siddvenk

vLLM added support for this functionality in 0.6.6 https://github.com/vllm-project/vllm/commit/5aef49806da2e6cc8a92c948d44e8a722469135f.

Our most recent container release currently leverage vllm 0.6.3.post1, which is why this behavior is not observed.

I have raised https://github.com/deepjavalibrary/djl-serving/pull/2685 to address this issue for the next container release - it's possible that we also update vllm by then, but in case we do not this should resolve that issue.

Jan 24 '25 21:01 siddvenk

@siddvenk Hello, can you share when the next release is scheduled ?

Feb 26 '25 19:02 eduardzl

This issue is stale because it has been open for 30 days with no activity.

Oct 23 '25 19:10 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Nov 07 '25 19:11 github-actions[bot]

Default values for inference from generation_config.json are not being applied

Description