FlagEmbedding 【onnx模型】关于bge-reranker-v2-m3模型转onnx模型相关对比情况讨论

各位兄弟姐妹们。我这里将bge-reranker-v2-m3 这个模型转换成了onnx模型，并进行了效率对比统计（GPU-A800）。发现onnx模型的推理效率较torch模型相差很多很多。具体对比见下图

000

从测试结果来看，onnx模型的模型推理耗时，比torch慢了 5.7 倍。针对这个情况，大家有没有什么建议和想法呀。欢迎大家来讨论。

torch模型的详细推理耗时情况也可以见： https://github.com/FlagOpen/FlagEmbedding/issues/969

Jul 23 '24 08:07 Tian14267

通过实验发现，在torch模型的 token耗时部分，之所以比较高（29.98 s），主要还是在tensor从CPU转到GPU，这里耗时多。在ONNX模型中，之所以模型耗时长，我初步猜测，因为onnx模型加载到了GPU中，虽然输入数据是numpy，但是依然存在将numpy数据加载到GPU的图中，这里的时间损耗应该比较多。

Jul 23 '24 08:07 Tian14267

请问一下，哪有能有bge-reranker-v2-m3转onnx和onnxruntime运行转换之后的模型的相关脚本呢？

Aug 01 '24 06:08 ZhouKai90

请问一下，哪有能有bge-reranker-v2-m3转onnx和onnxruntime运行转换之后的模型的相关脚本呢？

自己写呀~~~~

Aug 05 '24 07:08 Tian14267

各位兄弟姐妹们。我这里将bge-reranker-v2-m3 这个模型转换成了onnx模型，并进行了效率对比统计（GPU-A800）。发现onnx模型的推理效率较torch模型相差很多很多。具体对比见下图

从测试结果来看，onnx模型的模型推理耗时，比torch慢了 5.7 倍。针对这个情况，大家有没有什么建议和想法呀。欢迎大家来讨论。

torch模型的详细推理耗时情况也可以见： #969

额我用了一批较长的文档做了1000次测试，pytorch的平均耗时1.5s左右，onnx fp16的平均耗时为500ms左右还是有提升的。。请问你有序有试过其它框架或者方法吗，我还想能够再优化一下

Sep 13 '24 12:09 EvanSong77

请问一下，哪有能有bge-reranker-v2-m3转onnx和onnxruntime运行转换之后的模型的相关脚本呢？

可以使用optimum-cli

Oct 12 '24 08:10 akai-shuuichi

I convert bge-reranker-v2-m3 from pytorch to onnx format:

optimum-cli export onnx -m BAAI/bge-reranker-v2-m3 --opset 17 --optimize O2 --task feature-extraction --library-name transformers ~/bge-reranker-m3-onnx

Then inference as following:

from FlagEmbedding import FlagReranker


def local_rerank_flagembedding():
    """https://huggingface.co/BAAI/bge-reranker-v2-m3"""
    reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

    score = reranker.compute_score(['query', 'passage'])
    print(score) # -5.65234375

    # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
    score = reranker.compute_score(['query', 'passage'], normalize=True)
    print(score) # 0.003497010252573502

    scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
    print(scores) # [-8.1875, 5.26171875]

    # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
    scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True)
    print(scores) # [0.00027803096387751553, 0.9948403768236574]


def local_rerank_onnx(sentences_pairs, use_gpu=False):
    import onnxruntime as ort
    from transformers import AutoTokenizer
    import pathlib

    model_fp = pathlib.Path.home() / "bge-reranker-m3-onnx/model.onnx"
    tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
    if use_gpu:
        session_options = ort.SessionOptions()
        cuda_provider_options = {
            "device_id": 0,  # 使用 GPU 0
            "gpu_mem_limit": 4 * 1024 * 1024 * 1024,  # 限制显存为 4GB
            "arena_extend_strategy": "kNextPowerOfTwo",  # 显存分配策略
        }

        ort_session = ort.InferenceSession(
            model_fp,
            providers=["CUDAExecutionProvider"],
            provider_options=[cuda_provider_options],
            sess_options=session_options,
        )
    else:
        ort_session = ort.InferenceSession(model_fp)
    inputs = tokenizer(sentences_pairs, padding="longest", return_tensors="np")
    inputs_onnx = {k: ort.OrtValue.ortvalue_from_numpy(v) for k, v in inputs.items()}
    outputs = ort_session.run(None, inputs_onnx)
    print(f"Local inference via ONNX(use_gpu={use_gpu}):", outputs)


if __name__ == "__main__":
    local_rerank_flagembedding()
    # local_rerank_onnx(['query', 'passage'], use_gpu=False)
    local_rerank_onnx([['what is panda?', 'hi']], use_gpu=False)
    local_rerank_onnx([['what is panda?', 'hi'], ['what is panda?', 'hi']], use_gpu=False)
    local_rerank_onnx([['what is panda?', 'hi'], ['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], use_gpu=False)
    local_rerank_onnx([['what is panda?', 'hi'], ['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], use_gpu=False)
    print("PASS")

My question is: ort_session.run outpus shape is (1, 10, 1024), (2, 10, 1024), (3, 43, 1024), (4, 43, 1024). How to get the scores from the outputs?

Jan 10 '25 02:01 yuzhichang

def compute_score(sentence_pairs, batch_size: int = 16, max_length: int = 512, normalize: bool = False):
    with torch.no_grad():
        assert isinstance(sentence_pairs, list)
        if isinstance(sentence_pairs[0], str):
            sentence_pairs = [sentence_pairs]

        all_scores = []
        for start_index in tqdm(range(0, len(sentence_pairs), batch_size), desc="Compute Scores",
                                disable=len(sentence_pairs) < 128):
            sentences_batch = sentence_pairs[start_index:start_index + batch_size]
            s_t = time.time()
            inputs = tokenizer(
                sentences_batch,
                padding=True,
                truncation=True,
                return_tensors='np',
                max_length=max_length,
            )
            print(f'tokenizer耗时{(time.time() - s_t) * 1000}')
            s_t = time.time()
            onnx_inputs = {
                'input_ids': np.array(inputs['input_ids']),
                'attention_mask': np.array(inputs['attention_mask'])
            }
            outputs = session.run(output_names=["logits"], input_feed=onnx_inputs, run_options=ro1)[0].reshape(-1)
            print(f'推理速度耗时{(time.time() - s_t) * 1000}')
            all_scores.extend(outputs)

        if normalize:
            all_scores = [sigmoid(score) for score in all_scores]
        torch.cuda.empty_cache()
        return all_scores

参考源码的normalize

Jan 10 '25 05:01 EvanSong77

各位兄弟姐妹们。我这里将bge-reranker-v2-m3 这个模型转换成了onnx模型，并进行了效率对比统计（GPU-A800）。发现onnx模型的推理效率较torch模型相差很多很多。具体对比见下图从测试结果来看，onnx模型的模型推理耗时，比torch慢了 5.7 倍。针对这个情况，大家有没有什么建议和想法呀。欢迎大家来讨论。 torch模型的详细推理耗时情况也可以见： #969

各位兄弟姐妹们。我这里将bge-reranker-v2-m3 这个模型转换成了onnx模型，并进行了效率对比统计（GPU-A800）。发现onnx模型的推理效率较torch模型相差很多很多。具体对比见下图从测试结果来看，onnx模型的模型推理耗时，比torch慢了 5.7 倍。针对这个情况，大家有没有什么建议和想法呀。欢迎大家来讨论。 torch模型的详细推理耗时情况也可以见： #969

各位兄弟姐妹们。我这里将bge-reranker-v2-m3 这个模型转换成了onnx模型，并进行了效率对比统计（GPU-A800）。发现onnx模型的推理效率较torch模型相差很多很多。具体对比见下图从测试结果来看，onnx模型的模型推理耗时，比torch慢了 5.7 倍。针对这个情况，大家有没有什么建议和想法呀。欢迎大家来讨论。 torch模型的详细推理耗时情况也可以见： #969

额我用了一批较长的文档做了1000次测试，pytorch的平均耗时1.5s左右，onnx fp16的平均耗时为500ms左右还是有提升的。。请问你有序有试过其它框架或者方法吗，我还想能够再优化一下

兄弟，我再次测试了一下。Max_length调整到4096。测试数据1万条，batch=1，GPU（NVIDIA GeForce RTX 4090 -- 24G），pytorch推理总耗时292.88s；onnx-fp16 总耗时25min+。差距不是一般的大啊。测试过程监控的GPU和CPU，都没问题。确定在GPU推理的。这是啥情况啊。

同时我做过量化处理，想尽量提升效率，发现都不行。我试过torch量化、onnx量化、llama.cpp量化。效率都没得到提升。 另外，请问有没有沟通群之类的。群沟通效率应该会更高一些~ @EvanSong77

Jan 14 '25 03:01 Tian14267

onnx我们已经不用了，现在用的是vllm框架部署（现在已经支持bge的reranker模型了），实测效果比onnx要好不少，显存占用更少更稳定

Jan 14 '25 07:01 EvanSong77

@yuzhichang use --task text-classification, now it output scores.

Jan 15 '25 08:01 Haihsu

onnx我们已经不用了，现在用的是vllm框架部署（现在已经支持bge的reranker模型了），实测效果比onnx要好不少，显存占用更少更稳定

请问可以支持多卡部署 bge-reranker 吗？有部署的代码能参考一下嘛。我想试试能不能启动~ @EvanSong77

Jan 17 '25 09:01 Tian14267

应该是可以多卡，不过我没试过，我部署的docker的swarm脚本：

services:
  server1:
    image: xxx/llm-api:vllm-0.6.5
    command: python -m vllm.entrypoints.openai.api_server --task score --served-model-name bge-reranker-m3 --model /nfsshare/model-checkpoint/bge-reranker-v2-m3/
    environment:
      VLLM_API_KEY : xxxx
      CUDA_VISIBLE_DEVICES : 0
    volumes:
      - /etc/hosts:/etc/hosts
      - nfsshare:/nfsshare:ro
    deploy:
      replicas: 1
      placement:
        constraints: [node.role == manager]
      restart_policy:
        condition: any
    healthcheck:
      test: "curl --fail --request GET 'http://localhost:8000/health' --header 'Authorization: Bearer xxx'|| exit 1"
      interval: 60s
      retries: 1
      start_period: 65s
      timeout: 40s
  portal:
    image: xxx/nginx:1.22-alpine
    ports:
    - 11190:80
    command: |
      /bin/sh -c "echo '
      user nobody nogroup;
      worker_processes auto;
      events {
        worker_connections 1024;
      }
      http {
        server_names_hash_bucket_size 64;
        client_max_body_size 500m;
        keepalive_timeout  120s 120s;
        keepalive_requests 10000;
        proxy_send_timeout 120; 
        proxy_read_timeout 120; 
        proxy_buffer_size 128k;
        proxy_buffers   32 128k;
        proxy_busy_buffers_size 128k;
        upstream services {
          keepalive 1000;
          server server1:8000;
        }
        resolver 127.0.0.11 ipv6=off;
        server {
          listen *:80;
          location / {
            proxy_set_header Connection keep-alive;
            proxy_pass http://services;
            proxy_set_header  Host $$http_host;
            proxy_set_header  X-Real-IP $$remote_addr;
            proxy_set_header  X-Forwarded-For $$proxy_add_x_forwarded_for;
            proxy_buffering off;
            add_header X-Accel-Buffering "no";
            proxy_connect_timeout 600;
            proxy_send_timeout 600;
            proxy_read_timeout 600;
            send_timeout 600;
            proxy_http_version 1.1; 
            proxy_set_header  Connection \" \";
          }
        }
      }' | tee /etc/nginx/nginx.conf && nginx -t && nginx -g 'daemon off;'"
    deploy:
        mode: global
volumes:
  nfsshare:
    external: true
    name: nfsshare

Jan 20 '25 04:01 EvanSong77

onnx我们已经不用了，现在用的是vllm框架部署（现在已经支持bge的reranker模型了），实测效果比onnx要好不少，显存占用更少更稳定

请问可以支持多卡部署 bge-reranker 吗？有部署的代码能参考一下嘛。我想试试能不能启动~ @EvanSong77

这是我一个项目的docker-compose.yaml文件作为参考

services:
  QwQ-32B:
    container_name: Qwen-QwQ-32B-int4
    image: vllm/vllm-openai
    runtime: nvidia
    # ports:
    #   - 8000:8000
    network_mode: "host"
    volumes:
      - "/workspace/Qwen-QwQ-32B-int4:/root/.cache/huggingface/Qwen-QwQ-32B-int4/:ro"
    environment:
      - HF_HUB_OFFLINE=1
    ipc: host
    command:
      - --host=0.0.0.0
      - --model=/root/.cache/huggingface/Qwen-QwQ-32B-int4
      - --tensor_parallel_size=4
      - --max_model_len=32768
      - --quantization=gptq
      - --enable_sleep_mode
      - --dtype=auto
      - --device=cuda
      - --enable_reasoning
      - --reasoning-parser=deepseek_r1
      # - --gpu-memory-utilization=0.5
      - --served-model-name=Qwen-QwQ-32B
      - --compilation-config=3
      - --api-key=ad4fa1617778c024823ab9a60840cfc7fe60fdd635b5f41960c0502f4909c421

tensor_paralled_size 是张量并行数， pipeline-parallel-size是流水线并行数。这是常用的两个并行。总并行数是所有参数相乘。 vllm启动参数参考资料

Apr 29 '25 05:04 Kaos-dawn