Shitao Xiao comments

Results 509 comments of


                                            Shitao Xiao

Sparse vector storage in BGE-M3 can be implemented using FAISS？

@hahaha1121871443 , faiss cannot support the sparse vector.

bge-multilingual-gemma2：基于LLM的embedding模型比常规的embedding模型有哪些优势（what are the advantages of a LLM-based embed model to non LLM-based one）？

LLM本身参数量大，在大规模语料上进行了训练，语言理解能力很强。增加提示词，有利用模型区分不同任务，如sts任务和passage retrieval任务，这两个任务要求不同。

How to improve concurrency

You can try to use https://github.com/huggingface/text-embeddings-inference

ModuleNotFoundError: No module named 'peft' for BAAI/bge-reranker-v2-m3

> I solved it by running `pip install peft` . 文档还是写一下这个比较好已加到setup.py中：https://github.com/FlagOpen/FlagEmbedding/blob/master/setup.py#L28

grad_norm特别大，这样训练正常吗

感觉训崩了。28的loss太大了。请问是从哪个模型启动训练的。梯度爆炸一般需要调小学习率learning_rate，检查数据是否正常。 loss scaler用来放缩loss避免精度溢出。

reranker（bge-reranker-large） loss compute problem

@Jeremyywb ，0指的是这样本的位置，训练目标是最大化该位置的概率，可以参考torch官方文档对cross-entropy损失的描述：https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss

bge-m3 loss变化为0这样正常吗？

查看一下训练数据，可以用现有模型打一下分数看看，是否是负样本太简单，模型很容易区分。

bge-m3 loss变化为0这样正常吗？

这种构造数据的方式确实太简单了，可以尝试难样本挖掘策略：https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

Hi, @sarang-26 , it might be due to the runtime environment. You can try installing `pip install scikit-learn==1.3.2`, which does not have this issue in my experiment.

关于使用BGE-M3做自然语言推理任务

可以尝试使用bge-reranker-v2-m3模型去计算分数，这个分数比向量相似度更准确。自己训练的话主要是数据问题。也可以直接用大模型来判断是否相关。