fastllm icon indicating copy to clipboard operation
fastllm copied to clipboard

求助求助,llama模型生成第一个token时,有3个代码块耗时严重【已解决2个,仅剩1个🙏🙏🙏】

Open AnShengqiang opened this issue 1 year ago • 10 comments

前言 感谢开发者,研发出了如此易理解、好部署、配件完善的加速库🎉🎉🎉,真的很棒,对我很有帮助😊😊😊!!!

问题描述 有大量业务场景,仅需要模型生成单个token,如:新闻分类、逻辑推断、情感分析、关系提取、语种检测...。 在此类场景下,fastllm库中的llama模型实现(其它模型可能也存在)存在一个严重问题:随着batch size增大,耗时线性增长😰。 这个问题其他用户也复现了,见issue:ISSUE 337

复现细节

  • 硬件:显卡4090,内存cpu管够。
  • 模型:LlamaModel
  • 接口:batch_response
import pyfastllm
model_path = "tokenizer path"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code = True)

flm_model = pyfastllm.LlamaModel()
flm_model.load_weights("xxx_fp16.flm")

prompt = """..."""  # around 300 words

config = pyfastllm.GenerationConfig()
config.max_length = 1

batch_size = 16 # batch size=16时间是8的2倍(其它数值均尝试过,基本呈线性增长)
r = flm_model.batch_response([prompt]*batch_size, None, config)

问题分析 经过大量的耗时分析工作(调试、日志打印、分析工具等),我将耗时定位到了3个地方。 文件:llama.cpp 方法:std::vector LlamaModel::ForwardBatch(int batch, const fastllm::Data &inputIds, const fastllm::Data &attentionMask, const fastllm::Data &positionIds, std::vector<std::pair<Data, Data>> &pastKeyValues, const GenerationConfig &generationConfig, const LastTokensManager &lastTokens, std::vector <std::vector *> *retLogits) 耗时点1: PermuteSelf(q, {0, 2, 1, 3}); PermuteSelf(k, {0, 2, 1, 3}); PermuteSelf(v, {0, 2, 1, 3});

耗时点2: CatDirect(pastKey, k, 1); CatDirect(pastValue, v, 1);

耗时点3: logits.ToDevice(DataDevice::CPU);

这几点不起眼的操作,在第一个token输出时,耗时远大于😨真正的前向计算操作。

最后 由于我才疏学浅😞,难以解决此问题,因此请求帮助,希望开发者大大们协助解决此问题,谢谢♥

希望这个项目越来越好,star✨过万!

AnShengqiang avatar Oct 17 '23 10:10 AnShengqiang

感谢你的测试和反馈。

@ztxz16 大佬,我觉得耗时点3可以优化一下,prefill的时候,对logits,只把最后一个token的部分spli传回CPU做解码,这样可以加速超长prompt的推理,毕竟DeviceToHost很慢。

TylunasLi avatar Oct 17 '23 16:10 TylunasLi

@AnShengqiang llama.cpp代码比较老,里面算完logits后直接传回CPU了,可以参考chatglm.cpp里改一下,先topk再回CPU

siemonchan avatar Oct 18 '23 03:10 siemonchan

G(1K(E%UP0JZ8OZSTINBU$3

依次打印了各部分耗时: 总共1.51s耗时:3个操作加起来耗费 1.34s(PermuteSelf耗费0.431s,CatDirect耗费0.457s,ToDevice耗费0.455s)🫡

AnShengqiang avatar Oct 18 '23 03:10 AnShengqiang

@AnShengqiang llama.cpp代码比较老,里面算完logits后直接传回CPU了,可以参考chatglm.cpp里改一下,先topk再回CPU

这个可以的,前两个操作有希望规避不。这两个操作,也是随着batch size增大,耗时线性增长,能规避最好了。

AnShengqiang avatar Oct 18 '23 03:10 AnShengqiang

你是怎么统计时间的?cuda 同步之前测试的时间不对,使用nv专用的性能测试工具。

wildkid1024 avatar Oct 18 '23 04:10 wildkid1024

你是怎么统计时间的?cuda 同步之前测试的时间不对,使用nv专用的性能测试工具。

这个时间是直接用chrono算时间,cout打印出来的(没能找到耗时算子,出此下策)。示例代码:

auto start_time = std::chrono::high_resolution_clock::now();
logits.ToDevice(DataDevice::CPU);
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "ToDevice操作耗时: " << duration.count() << " 微秒" << std::endl;

我没能通过nsight找到耗时算子(如果有人熟悉这个,就太好啦),截图: image

AnShengqiang avatar Oct 18 '23 09:10 AnShengqiang

最新进展: #351 已经解决了后两个耗时的问题。速度提升巨大🎉🎉🎉感谢 @ztxz16 与 @siemonchan 的贡献 🫡

目前仅剩余第一个耗时问题。

AnShengqiang avatar Oct 19 '23 06:10 AnShengqiang

llama.cpp的ForwardBatch方法中 PermuteSelf(q, {0, 2, 1, 3}); PermuteSelf(k, {0, 2, 1, 3}); PermuteSelf(v, {0, 2, 1, 3}); 耗时依然很严重,占比很高(在首个token生成时),能否继续想办法优化或规避? @ztxz16 @siemonchan 🙏🙏🙏

AnShengqiang avatar Oct 22 '23 13:10 AnShengqiang

当前问题:

  1. 没有通过nvidia分析工具nsight找到耗时点。
  2. 通过土法(无奈之举,大佬见谅,明白cuda 同步之前测试的时间可能不对)chrono+cout打印耗时点,发现llama.cpp 299~301行(PermuteSelf代码块)耗时占比较多。

耗时占比**超过80%**😨(config.max_length = 1,输入prompt较长,batch_size=16) 详细见复现代码章节。

复现代码:

python:

...(省略模型加载,warm up等)

config = pyfastllm.GenerationConfig()
config.max_length = 1
prompt = """需求分析
新版本的fastllm中添加了ForwardBatch的功能,用于处理批量推理请求,单次推理请求会被视为batch为1的批量请求,这样做似乎没什么问题。
然而在具体实践中,用户的请求往往是一个一个来的,每来一个请求都要等上一个请求完成之后才能开始推理下一个请求,一旦并发数起来,体验将及其糟糕。

所幸,stream流式输出能够在一定程度上缓解这个问题,web前端调用是一个异步线程队列,那么多个用户间的web 前端IO的时间差恰好可以给推理留出一定时间,虽然后台依然是一个token一个token地进行推理,但前端却看起来能够多用户并发使用。

但不幸的是,这种方法只在一定程度上解决问题,当用户量变多时,后台由于大量堆积的待推理的tokens,用户的体验又将变得十分糟糕。

在fastllm中,使用的是动态组batch的方案,即当A请求正在运行的token和B请求正在运行的token进行组合,合并为一个batch,在模型中并行推理,以提高模型实际运行时的"""
r = flm_model.batch_response([prompt]*16, None, config)
print(r)

输出结果:(共耗时847毫秒,PermuteSelf耗时678毫秒,占比超过80%)

['推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理']
CPU times: user 738 ms, sys: 113 ms, total: 851 ms
Wall time: 847 ms
PermuteSelf 总耗时: 678021微秒

如何统计的PermuteSelf耗时?

...
    auto total_duration_permute = std::chrono::microseconds(0);
    
    for (int i = 0; i < block_cnt; i++) {
        ...
        auto s_pre = std::chrono::high_resolution_clock::now();
        // todo 速度嫌疑点,llama.cpp 299~301行
        PermuteSelf(q, {0, 2, 1, 3});
        PermuteSelf(k, {0, 2, 1, 3});
        PermuteSelf(v, {0, 2, 1, 3});
    
        auto s_aft = std::chrono::high_resolution_clock::now();
        total_duration_permute += std::chrono::duration_cast<std::chrono::microseconds>(s_aft - s_pre);
        ...
    }
    
    std::cout << "PermuteSelf 总耗时: " << total_duration_permute.count() << "微秒" << std::endl;
...  

补充打印:

在PermuteSelf方法中,打印耗时,耗时非常规律,4短1长🚩:

void PermuteSelf(const Data &input, const std::vector<int> &axis) {
        Data axisData = Data(DataType::INT32PARAM, {(int)axis.size()});
        axisData.Allocate();
        for (int i = 0; i < axisData.Count(0); i++) {
            ((int32_t*)axisData.cpuData)[i] = axis[i];
        }
        auto s1 = std::chrono::high_resolution_clock::now();
        curExecutor->Run("PermuteSelf", {
                {"input", (Data*)&input}, {"axis", &axisData}
        }, {}, {});
        auto s2 = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(s2 - s1).count();

        std::cout << "Time taken for PermuteSelf stage-2: " << duration << " microseconds" << std::endl;
}

输出如下,调用次数规律很明显,耗时基本4短1长(除第一个以外): 耗时相加和Permute总耗时基本一致。

['推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理', '推理']
CPU times: user 714 ms, sys: 107 ms, total: 821 ms
Wall time: 818 ms
Time taken for PermuteSelf stage-2: 426 microseconds
Time taken for PermuteSelf stage-2: 298 microseconds
Time taken for PermuteSelf stage-2: 306 microseconds
Time taken for PermuteSelf stage-2: 77 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 18877 microseconds
Time taken for PermuteSelf stage-2: 313 microseconds
Time taken for PermuteSelf stage-2: 316 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 20658 microseconds
Time taken for PermuteSelf stage-2: 325 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21236 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21184 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 325 microseconds
Time taken for PermuteSelf stage-2: 4 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21289 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 325 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21284 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 21272 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 21229 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21146 microseconds
Time taken for PermuteSelf stage-2: 321 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21125 microseconds
Time taken for PermuteSelf stage-2: 320 microseconds
Time taken for PermuteSelf stage-2: 321 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21053 microseconds
Time taken for PermuteSelf stage-2: 319 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 4 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 21064 microseconds
Time taken for PermuteSelf stage-2: 321 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 4 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 21139 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21211 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21151 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 21154 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21157 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21238 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21154 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21138 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21191 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 326 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21359 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21277 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21224 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21153 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21247 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21271 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 324 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21260 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21154 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21135 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 322 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
Time taken for PermuteSelf stage-2: 21171 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 323 microseconds
Time taken for PermuteSelf stage-2: 3 microseconds
Time taken for PermuteSelf stage-2: 2 microseconds
PermuteSelf 总耗时: 675325微秒

AnShengqiang avatar Oct 25 '23 15:10 AnShengqiang

@AnShengqiang 你这个耗时问题是稳定复现的?

yiguanxian avatar Jan 19 '24 18:01 yiguanxian