inference
inference copied to clipboard
【性能优化】bge-reranker-v2-minicpm-layerwise 部署性能问题
Describe the bug
使用最新版本 xinference 部署 bge-reranker-v2-minicpm-layerwise,modescope 无法下载,更换 huggingface 后部署成功,但在使用的时候耗时特别严重,基本无法应用。
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/root/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2663: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
To Reproduce
To help us to reproduce this bug, please provide information below:
Python 3.10.8 Xinference v0.10.3
其他信息
我在 huggingface 讨论社区找到以下线索:
https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise/discussions/1
FlagEmbeding 不发版本的话,增加这个参数很容易导致错误。
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.