Shitao Xiao
Shitao Xiao
@ngothanhnam0910 , the normalized scores are not appropriate for fine-tuning because the distribution is too smoothed after softmax. You should use the scores before normalizing.
@jhyeom1545 , I guess the reason is that the negative samples are too challenging, so that the model has to reduce the scores. You can use a larger sample range...
We pre-train and fine-tune bge-m3 one long texts. You can refer to our paper: https://arxiv.org/abs/2402.03216
@chengzi-big , the transformer architecture does not have a length limit. The limitation of length comes from the positional encoding. We use the absolute positional encoding with a length of...
`max_lenth` is the maximum number of tokens. We will truncate the text and only keep the first 8192 tokens. The upper bound of `max_length` in bge-reranker-v2-* is 8192. A larger...
A larger max_length allows the model to process long texts, but it comes with more computational consumption. The small default value: 512 is to speed up the inference. If most...
1. 可以,bge是通用基础模型。 2. 完整的稀疏向量维度是一致的。例子https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#generate-embedding-for-text中获取的是一个dict,其中是包含的token以及对应的的权重,相当于系数向量中有值的部分(其他部分都是0)。
1. 不需要,训练的时候也没有去处停用词。但手动去除的话可能可以提高效果,这块我们也没有尝试过 2. 无监督版/半监督的对比学习训练代码和有监督是一样的,参考https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune, 只是数据是通过固定的方式进行构造的(如标题-正文),而不是人工标注数据。
This script uses the huggingface trainer to do fine-tuning, so you can use the hyper-arguments on this page: https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments
推理完的embedding会放到cpu上