lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Feature] 对api_server的一些建议

Open ly19970621 opened this issue 1 year ago • 4 comments

Motivation

我用lmdeploy serve api_server起了一个推理服务InternVL2-26B,发现了以下问题: 1、设置stream=true时,发现每一条都会有"usage":null的内容,生成最后一个token时,"usage"并没有具体的输入和输出tokens信息,这样的话就没有办法统计流式接口的tokens,我理解这不是标准的openai格式,请各位大佬关注修一下。 2、压测接口时,逐渐增加并发数,发现显存占用一直在变多,不请求也没有清除多余显存占用,建议增加每次请求完成后清理显存占用的操作。 3、服务部署成功后,内存似乎没有释放,free -m 发现有大量的内存在buff/cache里,请各位大佬关注一下。 Snipaste_2024-07-26_15-11-05

Related resources

No response

Additional context

No response

ly19970621 avatar Jul 26 '24 07:07 ly19970621

Motivation

我用lmdeploy serve api_server起了一个推理服务InternVL2-26B,发现了以下问题: 1、设置stream=true时,发现每一条都会有"usage":null的内容,生成最后一个token时,"usage"并没有具体的输入和输出tokens信息,这样的话就没有办法统计流式接口的tokens,我理解这不是标准的openai格式,请各位大佬关注修一下。 2、压测接口时,逐渐增加并发数,发现显存占用一直在变多,不请求也没有清除多余显存占用,建议增加每次请求完成后清理显存占用的操作。 3、服务部署成功后,内存似乎没有释放,free -m 发现有大量的内存在buff/cache里,请各位大佬关注一下。 Snipaste_2024-07-26_15-11-05

Related resources

No response

Additional context

No response

  1. openAI对于流的usage处理应当是是 stream_options: {"include_usage": true} ,应该实现这个参数的处理

akai-shuuichi avatar Jul 26 '24 08:07 akai-shuuichi

@AllentDan may check stream_options: {"include_usage": true}

lvhan028 avatar Aug 15 '24 04:08 lvhan028

@AllentDan may check stream_options: {"include_usage": true}

We have to return the usage in each streaming response. include_usage is different.

If set, an additional chunk will be streamed before the data: [DONE] message. The usage field on this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value.

AllentDan avatar Aug 15 '24 04:08 AllentDan

1、设置stream=true时,发现每一条都会有"usage":null的内容,生成最后一个token时,"usage"并没有具体的输入和输出tokens信息,这样的话就没有办法统计流式接口的tokens,我理解这不是标准的openai格式,请各位大佬关注修一下。 最后一个token会给出输入、输出的token数。流式输出过程中,这个信息是空。是标准的openai格式。 如果最后一个token没有给,烦请提供下复现的方式。

2、压测接口时,逐渐增加并发数,发现显存占用一直在变多,不请求也没有清除多余显存占用,建议增加每次请求完成后清理显存占用的操作。 显存一旦申请,不释放。频繁释放再申请会导致性能变差。

3、服务部署成功后,内存似乎没有释放,free -m 发现有大量的内存在buff/cache里,请各位大佬关注一下。 可能是transformers引起的

lvhan028 avatar Aug 15 '24 05:08 lvhan028

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

github-actions[bot] avatar Aug 23 '24 02:08 github-actions[bot]

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

github-actions[bot] avatar Aug 28 '24 02:08 github-actions[bot]