bge-reranker-v2-minicpm-layerwise 模型重复加载问题
每次compute_score都会进行模型加载,耗时严重。如何缩短该部分时长?
0%| | 0/8 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 8/8 [00:12<00:00, 1.53s/it]
这里用了dataloader来加载数据,所以每次都会加载这些,可以将所有要计算的数据一起输入进去,这样会更节省时间
谢谢解答,但是导致每次推理速度增加。base模型没有类似操作吗?
因为llm-based reranker引入了prompt,所以每次都得用dataloader、分批pad保证长度一致 但是base模型直接使用query和passagee就可以了,不需要prompt,所以避免了这一过程
终于理解了,所以多批对话直接用bge-reranker-v2-minicpm-layerwise速度不太跟得上,请问有其他办法提速吗?蒸馏?
提速只能从dataloader下手,我们会尝试优化这部分功能
现在代码已经更新,在compute_score的时候可以传参use_dataloader=False来解决这个问题
Pypi 是不是还没有更新版本
遇到同样的性能问题,使用的是 Xinference 方式部署,bge-rerank-v2-m3 没有这个问题。
现在代码已经更新,在compute_score的时候可以传参
use_dataloader=False来解决这个问题
那请问一个batch里面 比如说我输入一个qa对是100ms,输入十个qa对是1000ms这个要怎么解决啊?