Shitao Xiao comments

Results 509 comments of


                                            Shitao Xiao

bge-ranker-large 领域应用，微调，大概需要多少数据

千这个级别的数据基本足够，但越多越好

langchain intergration with bge-m3 or llama-idnex?

Sorry, we also don't know how to use sparse vector in langchain. You can use sparse in vespa and Milvus.

模型保存时有问题

可能是由于没有在根目录创建文件夹的权限。建议换一个地方存储，比如当前目录 ./checkpoints_tmp/

BGE-M3 - Milvus JSON index not supported

Sorry, the latest milvus has not been released, so the example cannot be used currently. We have deleted this example.

BGE-M3 - Milvus JSON index not supported

The new version of milvus has been released. You can use the hybrid retrieval of bge-m3 following https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`

It seems that there are some errors in your cuda environment.

评测了下bge-m3的colbert在mteb的rerank任务上的指标

您好，我们没有测过该指标，无法判断是否正确，不过结果看起来是比较正常的。 bge-m3-colbert只在第三阶段微调中使用过，第三阶段中中文数据并不多，没有像稠密向量那样大量数据中经过了retromae预训练和无监督对比学习。 colbert这种多向量的模式，会保留所有token的信息，在域外确实更容易具备优势。

相似句对标准问题的相似度分数差距较大

您好，推荐使用新版模型bge-m3。稠密向量和以前的bge一样的使用方式

sparse得分

1. 所有模型都不完美，存在优化空间 2. 稀疏向量模型（包括bm25）根据单词重要性计算分数，并没有对分数进行归一化，分数不存在像稠密向量相似度那样分布在[-1,1]区间。同时排序任务主要关注相对大小，而不是绝对值大小。因此，这个得分是在预期之内的。当然，如果您有更好的计算分数的方式，欢迎讨论。

平均耗时疑惑

计算时间与机器性能相关。同时推理前需要进行tokenizer，这会占据一定时间。