fastllm
fastllm copied to clipboard
求助求助,llama模型生成第一个token时,有3个代码块耗时严重【已解决2个,仅剩1个🙏🙏🙏】
前言 感谢开发者,研发出了如此易理解、好部署、配件完善的加速库🎉🎉🎉,真的很棒,对我很有帮助😊😊😊!!!
问题描述 有大量业务场景,仅需要模型生成单个token,如:新闻分类、逻辑推断、情感分析、关系提取、语种检测...。 在此类场景下,fastllm库中的llama模型实现(其它模型可能也存在)存在一个严重问题:随着batch size增大,耗时线性增长😰。 这个问题其他用户也复现了,见issue:ISSUE 337
复现细节
- 硬件:显卡4090,内存cpu管够。
- 模型:LlamaModel
- 接口:batch_response
import pyfastllm
model_path = "tokenizer path"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code = True)
flm_model = pyfastllm.LlamaModel()
flm_model.load_weights("xxx_fp16.flm")
prompt = """...""" # around 300 words
config = pyfastllm.GenerationConfig()
config.max_length = 1
batch_size = 16 # batch size=16时间是8的2倍(其它数值均尝试过,基本呈线性增长)
r = flm_model.batch_response([prompt]*batch_size, None, config)
问题分析
经过大量的耗时分析工作(调试、日志打印、分析工具等),我将耗时定位到了3个地方。
文件:llama.cpp
方法:std::vector
耗时点2: CatDirect(pastKey, k, 1); CatDirect(pastValue, v, 1);
耗时点3: logits.ToDevice(DataDevice::CPU);
这几点不起眼的操作,在第一个token输出时,耗时远大于😨真正的前向计算操作。
最后 由于我才疏学浅😞,难以解决此问题,因此请求帮助,希望开发者大大们协助解决此问题,谢谢♥
希望这个项目越来越好,star✨过万!
感谢你的测试和反馈。
@ztxz16 大佬,我觉得耗时点3可以优化一下,prefill的时候,对logits,只把最后一个token的部分spli传回CPU做解码,这样可以加速超长prompt的推理,毕竟DeviceToHost很慢。
@AnShengqiang llama.cpp代码比较老,里面算完logits后直接传回CPU了,可以参考chatglm.cpp里改一下,先topk再回CPU
依次打印了各部分耗时: 总共1.51s耗时:3个操作加起来耗费 1.34s(PermuteSelf耗费0.431s,CatDirect耗费0.457s,ToDevice耗费0.455s)🫡
@AnShengqiang llama.cpp代码比较老,里面算完logits后直接传回CPU了,可以参考chatglm.cpp里改一下,先topk再回CPU
这个可以的,前两个操作有希望规避不。这两个操作,也是随着batch size增大,耗时线性增长,能规避最好了。
你是怎么统计时间的?cuda 同步之前测试的时间不对,使用nv专用的性能测试工具。
你是怎么统计时间的?cuda 同步之前测试的时间不对,使用nv专用的性能测试工具。
这个时间是直接用chrono算时间,cout打印出来的(没能找到耗时算子,出此下策)。示例代码:
auto start_time = std::chrono::high_resolution_clock::now();
logits.ToDevice(DataDevice::CPU);
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "ToDevice操作耗时: " << duration.count() << " 微秒" << std::endl;
我没能通过nsight找到耗时算子(如果有人熟悉这个,就太好啦),截图:
最新进展: #351 已经解决了后两个耗时的问题。速度提升巨大🎉🎉🎉感谢 @ztxz16 与 @siemonchan 的贡献 🫡
目前仅剩余第一个耗时问题。
llama.cpp的ForwardBatch方法中 PermuteSelf(q, {0, 2, 1, 3}); PermuteSelf(k, {0, 2, 1, 3}); PermuteSelf(v, {0, 2, 1, 3}); 耗时依然很严重,占比很高(在首个token生成时),能否继续想办法优化或规避? @ztxz16 @siemonchan 🙏🙏🙏
当前问题:
- 没有通过nvidia分析工具nsight找到耗时点。
- 通过土法(无奈之举,大佬见谅,明白cuda 同步之前测试的时间可能不对)chrono+cout打印耗时点,发现llama.cpp 299~301行(PermuteSelf代码块)耗时占比较多。
耗时占比**超过80%**😨(config.max_length = 1,输入prompt较长,batch_size=16) 详细见复现代码章节。
复现代码:
python:
...(省略模型加载,warm up等)
config = pyfastllm.GenerationConfig()
config.max_length = 1
prompt = """需求分析
新版本的fastllm中添加了ForwardBatch的功能,用于处理批量推理请求,单次推理请求会被视为batch为1的批量请求,这样做似乎没什么问题。
然而在具体实践中,用户的请求往往是一个一个来的,每来一个请求都要等上一个请求完成之后才能开始推理下一个请求,一旦并发数起来,体验将及其糟糕。
所幸,stream流式输出能够在一定程度上缓解这个问题,web前端调用是一个异步线程队列,那么多个用户间的web 前端IO的时间差恰好可以给推理留出一定时间,虽然后台依然是一个token一个token地进行推理,但前端却看起来能够多用户并发使用。
但不幸的是,这种方法只在一定程度上解决问题,当用户量变多时,后台由于大量堆积的待推理的tokens,用户的体验又将变得十分糟糕。
在fastllm中,使用的是动态组batch的方案,即当A请求正在运行的token和B请求正在运行的token进行组合,合并为一个batch,在模型中并行推理,以提高模型实际运行时的"""
r = flm_model.batch_response([prompt]*16, None, config)
print(r)
输出结果:(共耗时847毫秒,PermuteSelf耗时678毫秒,占比超过80%)
['推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理']
CPU times: user 738 ms, sys: 113 ms, total: 851 ms
Wall time: 847 ms
PermuteSelf 总耗时: 678021微秒
如何统计的PermuteSelf耗时?
...
auto total_duration_permute = std::chrono::microseconds(0);
for (int i = 0; i < block_cnt; i++) {
...
auto s_pre = std::chrono::high_resolution_clock::now();
// todo 速度嫌疑点,llama.cpp 299~301行
PermuteSelf(q, {0, 2, 1, 3});
PermuteSelf(k, {0, 2, 1, 3});
PermuteSelf(v, {0, 2, 1, 3});
auto s_aft = std::chrono::high_resolution_clock::now();
total_duration_permute += std::chrono::duration_cast<std::chrono::microseconds>(s_aft - s_pre);
...
}
std::cout << "PermuteSelf 总耗时: " << total_duration_permute.count() << "微秒" << std::endl;
...
补充打印:
在PermuteSelf方法中,打印耗时,耗时非常规律,4短1长🚩:
void PermuteSelf(const Data &input, const std::vector<int> &axis) {
Data axisData = Data(DataType::INT32PARAM, {(int)axis.size()});
axisData.Allocate();
for (int i = 0; i < axisData.Count(0); i++) {
((int32_t*)axisData.cpuData)[i] = axis[i];
}
auto s1 = std::chrono::high_resolution_clock::now();
curExecutor->Run("PermuteSelf", {
{"input", (Data*)&input}, {"axis", &axisData}
}, {}, {});
auto s2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(s2 - s1).count();
std::cout << "Time taken for PermuteSelf stage-2: " << duration << " microseconds" << std::endl;
}
输出如下,调用次数规律很明显,耗时基本4短1长(除第一个以外): 耗时相加和Permute总耗时基本一致。
['推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理']
CPU times: user 714 ms, sys: 107 ms, total: 821 ms
Wall time: 818 ms
Time taken for PermuteSelf stage-2: 426 microseconds
Time taken for PermuteSelf stage-2: 298 microseconds
Time taken for PermuteSelf stage-2: 306 microseconds
Time taken for PermuteSelf stage-2: 77 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 18877 microseconds
Time taken for PermuteSelf stage-2: 313 microseconds
Time taken for PermuteSelf stage-2: 316 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 20658 microseconds
Time taken for PermuteSelf stage-2: 325 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21236 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21184 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 325 microseconds
Time taken for PermuteSelf stage-2: 4 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21289 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 325 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21284 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 21272 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 21229 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21146 microseconds
Time taken for PermuteSelf stage-2: 321 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21125 microseconds
Time taken for PermuteSelf stage-2: 320 microseconds
Time taken for PermuteSelf stage-2: 321 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21053 microseconds
Time taken for PermuteSelf stage-2: 319 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 4 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 21064 microseconds
Time taken for PermuteSelf stage-2: 321 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 4 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 21139 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21211 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21151 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 21154 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21157 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21238 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21154 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21138 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21191 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 326 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21359 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21277 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21224 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21153 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21247 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21271 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21260 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21154 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21135 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21171 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
PermuteSelf 总耗时: 675325微秒
@AnShengqiang 你这个耗时问题是稳定复现的?