fastllm
fastllm copied to clipboard

Published 20 hours ago •

Reame
Issues

batch_response() 耗时和prompt list长度成线性关系

Open Lzhang-hub opened this issue 2 years ago • 9 comments

st=time.time()
prompts=[text]
config = pyfastllm.GenerationConfig()
res=model.batch_response(prompts, None, config)
one_time=time.time()-st
print(one_time)

multi_st=time.time()
prompts=[text,text,text,text]
config = pyfastllm.GenerationConfig()
res=model.batch_response(prompts, None, config)
multi_time=time.time()-multi_st
print(multi_time)

multi_time 差不多是one_time的四倍？请教一下是有参数配置的不合理导致的嘛

Sep 27 '23 06:09 Lzhang-hub

看了一下源码，Forwardbatch里面是for循环调用的Forward？这样是不是没有任何加速效果

Sep 27 '23 07:09 Lzhang-hub

看了一下源码，Forwardbatch里面是for循环调用的Forward？这样是不是没有任何加速效果

这个是基类的函数，具体的模型（chatglm.cpp, llama.cpp这些）里面的batch不是一个一个跑的

Sep 28 '23 07:09 ztxz16

@ztxz16 哦哦，看到了，但是耗时线性增长会啥原因呀，chatglm2模型，直接用demo里面的web_api 中的/api/batch_chat接口测试也是和list长度呈线性关系

Sep 28 '23 07:09 Lzhang-hub

这种方式测试的是prefill阶段，建议测试decode阶段。

Sep 29 '23 03:09 wildkid1024

@Lzhang-hub hello 我看llm.py里面没有定义batch_response()这个函数呀能问下这个函数在哪里吗？

Oct 08 '23 08:10 iFocusing

@Lzhang-hub hello 我看llm.py里面没有定义batch_response()这个函数呀能问下这个函数在哪里吗？

@ztxz16 能帮忙指导一下如何使用batch_response()吗？

Oct 08 '23 09:10 iFocusing

使用 pyfastllm.

Oct 08 '23 10:10 wildkid1024

这种方式测试的是prefill阶段，建议测试decode阶段。

@wildkid1024 我理解这种直接采用web api部署的测试方式应该是更符合实际的模型部署场景，现在我们有一个batch推理的生产需求，只测试decode阶段是不是不能满足实际推理场景呀

Oct 12 '23 12:10 Lzhang-hub

same question

Oct 16 '23 07:10 AnShengqiang