djl-serving
djl-serving copied to clipboard
DeepSpeed streaming, max_length is ignored
serving.properties:
option.model_id=EleutherAI/gpt-neo-1.3B
option.task=text-generation
option.tensor_parallel_degree=2
option.dtype=fp16
option.enable_streaming=true
#option.enable_streaming=huggingface
engine=DeepSpeed
option.parallel_loading=true
curl command:
curl -X POST "http://localhost:8080/invocations" \
-H "content-type: application/json" \
-d '{"inputs": ["Large language model is"], "parameters": {"max_length" :2}}'
Expected to return 2 new tokens, but 50 tokens are returned
This is not a valid inputs, given max_length < input_token size. You may want to use a value larger than input token length, or use max_new_tokens
instead to avoid input token size limitation
tried max_length = 25, still the same. We need standard the parameters as much as possible.