jetson-generative-ai-playground
jetson-generative-ai-playground copied to clipboard
Add Max Context Len value to get around with context len 1 error
Originally, --max-model-len=1 was set and it was causing the "context length 1" error.
"This model's maximum context length is 1 tokens. However, you requested 28 tokens in the messages, Please reduce the length of the messages."
This work around will generate the command like below, instead of --max-model-len=1.
Verified with JAO 64GB & JAO 32GB (gemma-3-4b-it), Orin NX 16GB ( with gemma-3-1b-it)vlm.py.
docker run -it --rm \
--name llm_server \
--gpus all \
-p 9000:9000 \
-e DOCKER_PULL=always --pull always \
-e HF_TOKEN=${HUGGINGFACE_TOKEN} \
-e HF_HUB_CACHE=/root/.cache/huggingface \
-v /mnt/nvme/cache:/root/.cache \
dustynv/vllm:0.7.4-r36.4.0-cu128-24.04 \
vllm serve google/gemma-3-4b-it \
--host=0.0.0.0 --port=9000 --dtype=auto --max-num-seqs=1 --max-model-len=8192 --gpu-memory-utilization=0.75