Shitao Xiao comments

Results 509 comments of


                                            Shitao Xiao

m3 embedding model pos score(similarity) is getting lower

@ngothanhnam0910 , the normalized scores are not appropriate for fine-tuning because the distribution is too smoothed after softmax. You should use the scores before normalizing.

m3 embedding model pos score(similarity) is getting lower

@jhyeom1545 , I guess the reason is that the negative samples are too challenging, so that the model has to reduce the scores. You can use a larger sample range...

请问BGE-M3中的multi-Granularity中的最大文档长度8192tokens是怎么实现的

We pre-train and fine-tune bge-m3 one long texts. You can refer to our paper: https://arxiv.org/abs/2402.03216

请问BGE-M3中的multi-Granularity中的最大文档长度8192tokens是怎么实现的

@chengzi-big , the transformer architecture does not have a length limit. The limitation of length comes from the positional encoding. We use the absolute positional encoding with a length of...

BAAI/bge-reranker-v2-m3 模型中是如何計算輸入的 max_length ?

`max_lenth` is the maximum number of tokens. We will truncate the text and only keep the first 8192 tokens. The upper bound of `max_length` in bge-reranker-v2-* is 8192. A larger...

BAAI/bge-reranker-v2-m3 模型中是如何計算輸入的 max_length ?

A larger max_length allows the model to process long texts, but it comes with more computational consumption. The small default value: 512 is to speed up the inference. If most...

否可直接使用bge-m3的dense和sparse用于检索任务

1. 可以，bge是通用基础模型。 2. 完整的稀疏向量维度是一致的。例子https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#generate-embedding-for-text中获取的是一个dict，其中是包含的token以及对应的的权重，相当于系数向量中有值的部分（其他部分都是0）。

否可直接使用bge-m3的dense和sparse用于检索任务

1. 不需要，训练的时候也没有去处停用词。但手动去除的话可能可以提高效果，这块我们也没有尝试过 2. 无监督版/半监督的对比学习训练代码和有监督是一样的，参考https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune，只是数据是通过固定的方式进行构造的（如标题-正文），而不是人工标注数据。

how to adjust hyperparameter for finetune llm embed

This script uses the huggingface trainer to do fine-tuning, so you can use the hyper-arguments on this page: https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments

推理时，使用gpu运行时cpu内存为什么也会占掉好几个G

推理完的embedding会放到cpu上