LLaMA-Factory 超过了设置的最大token数，模型还是有返回

超过了设置的最大token数，模型还是有返回

Open luhairong11 opened this issue 8 months ago • 0 comments

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

python src/api.py --model_name_or_path /data/models/LLM_models/qwen/Qwen-72B-Chat-Int4 --template qwen --infer_backend vllm --vllm_gpu_util 0.9 --vllm_maxlen 8000 上述配置设置了最大token为8000，当输入token超过8000的时候，流式调用接口的时候还是会返回2条空内容的json数据，vllm底层会有一个警告，提示超过了最大token。咱们代码里面能不能抛出一个异常错误，这样返回的内容便于直观理解。

Expected behavior

No response

System Info

No response

Others

No response

May 29 '24 15:05 luhairong11

LLaMA-Factory LLaMA-Factory copied to clipboard

超过了设置的最大token数，模型还是有返回

Reminder

Reproduction

Expected behavior

System Info

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard