[Question] How can I set the max tokens with tutelgroup/deepseek-671b
Hi,
I ran the script to deploy kimi-k2 model using tutelgroup/deepseek-671b. However, the API request with the max_tokens parameter does not seem to work as expected. Is this parameter supported?
The max_tokens isn't handled yet through REST JSON API. Instead, it is currently a static global setting, which is specified by the argument --max_seq_len? (the version 20250715 has a fine-grain max_tokens optimization over cudaGraph.) e.g.:
docker run -it --rm --ipc=host --net=host --shm-size=8g --ulimit memlock=-1 \
--ulimit stack=67108864 --gpus=all -v /:/host -w /host$(pwd) \
tutelgroup/deepseek-671b:a100x8-chat-20250715 \
--try_path ./moonshotai/Kimi-K2-Instruct \
--serve --listen_port 8000 \
--prompt "Calculate the indefinite integral of 1/sin(x) + x" \
--max_seq_len 4200
But it is a good feature request and we'll handle those settings through REST API in the future.
Thank you. Looking forward to your next update!
The new version of tutelgroup/deepseek-671b:mi300x8-chat-20250723 allows setting max_tokens in REST:
curl -X POST http://0.0.0.0:8000/chat -d '{"messages": [{"role": "user", "content": "Hello."}], "max_tokens": 256 }'