FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

Retrieval and Retrieval-augmented LLMs

Results 622 FlagEmbedding issues
Sort by recently updated
recently updated
newest added

就是在微调完embedding模型之后,我想利用自己准备的数据来评估一下模型效果,但是输入模型权重路径之后一直显示“Model name 'model_train' not found in the model mapping”错误。 然后查了很久,发现是:在FlagEmbedding 的设计里,它需要同时知道 模型类型 和 权重路径: 模型类型(例如 bge-m3、bge-base-zh 等)告诉框架要用哪个类来实例化模型(BGEM3FlagModel 等)。 权重路径告诉框架用你保存的 checkpoint 来初始化模型参数。 框架在读取 --embedder_name_or_path 时,会把最后一层目录名当作模型名去 AUTO_EMBEDDER_MAPPING 查找。如果目录名是 checkpoint-2110,框架就会把 "checkpoint-2110" 当作模型名去找,但映射里没有这个名字,所以报错。 所以我们还需要加上一句...

Thank you very much for BGE-M3! I am implementing something similar, i found a line in your code that puzzles me a bit: https://github.com/FlagOpen/FlagEmbedding/blob/2225aacb54cf9e807aa116dfffeb0cceb291b38b/FlagEmbedding/finetune/embedder/encoder_only/m3/modeling.py#L227 might it be that the colbert...

When I use BGE-Code-V1 (Qwen2.5-Coder-1.5B based) as retriever in my RAG pipeline, I find that query–chunk similarity scores are always around ~0.541, regardless of the query and document content. Task:...

great code can you share , if you fine tune Mistral or only use smart prompting ?

The encoder trainers all appear to be train only, which seems really odd to me. Please explain the design choice to not have eval during training. It seems very standard.

def _pool(self, embeddings, attention_mask): if "mean" in self.pooling_method: embeddings = embeddings.masked_fill( ~attention_mask[..., None].bool(), 0.0) embedding = embeddings.sum( dim=1) / attention_mask.sum(dim=1, keepdim=True) elif "cls" in self.pooling_method: embedding = embeddings[:, 0] elif...

为什么空字符串和其他字符串的相似度都有0.5以上? ![image](https://github.com/FlagOpen/FlagEmbedding/assets/161291221/86732cab-39e2-42d9-8201-a61f2ce623c4) 相似度为: ![image](https://github.com/FlagOpen/FlagEmbedding/assets/161291221/12b916ff-4c7e-4e50-abd2-dd9fb4482767) 打印出了空字符的向量: ![image](https://github.com/FlagOpen/FlagEmbedding/assets/161291221/4957c801-95d6-46c0-b147-1ce0422d9958)

I found that the vector results of Hangzhou City obtained by these two methods are different. What is the reason for this code: ` model = HuggingFaceEmbedding( model_name='/home/nepf/hwd/bge-m3/', device="cpu", )...

你好! 非常感谢你们开源这两个业界最强的多语种模型,支持190+语种。我现在想将语种缩减到7个,分别是中英日韩西法阿。想请教下应该怎么做? 我的一些想法: 1.将中英日韩西法阿这7个语种的数据在bge-m3和bge-rerank-v2-m3这两个模型上进行微调,让参数向这7个语种靠拢。 2.将词表进行裁剪,只保留中英日韩西法阿这7个语种,然后再重复1的工作。 不知道是否可行?