inference 【性能优化】bge-reranker-v2-minicpm-layerwise 部署性能问题

Describe the bug

使用最新版本 xinference 部署 bge-reranker-v2-minicpm-layerwise，modescope 无法下载，更换 huggingface 后部署成功，但在使用的时候耗时特别严重，基本无法应用。

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/root/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2663: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.

To Reproduce

To help us to reproduce this bug, please provide information below:

Python 3.10.8 Xinference v0.10.3

其他信息

我在 huggingface 讨论社区找到以下线索：

https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise/discussions/1

Apr 25 '24 07:04 coswind

FlagEmbeding 不发版本的话，增加这个参数很容易导致错误。

Apr 26 '24 09:04 qinxuye

This issue is stale because it has been open for 7 days with no activity.

Aug 06 '24 19:08 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

Aug 12 '24 03:08 github-actions[bot]