Shitao Xiao comments

Results 509 comments of


                                            Shitao Xiao

hybrid_search对于sparse向量的长度有什么要求吗？

建议询问Milvus团队，不清楚这个问题

loss 下降到2.7就不下降了

The loss seems large. I guess there are pseudo-negative examples in the training data (some samples in the negative sample list `neg: List[str]` are actually positive samples).

finetuning failing

You should show your command and the environment so that I can analyze the possible cause.

sparse向量的存储类型

存成dict，参考https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#generate-embedding-for-text 中Sparse Embedding (Lexical Weight)

m3显存

@songge25 ，可以通过使用deepspeed stage-1/2/3来降低显存使用，只需修改：https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/ds_config.json#L36 中 `"stage": 0`, 0改为1、2或3（降低显存能力：3>2>1)

> 而且我观察到在run.py 中per_device_train_batch_size写死成了1，这是为什么呢？ @zuoyifan132 你指的是bge-m3的向量微调代码吧。我们没有使用统一的batch size，而是在data.py里为每个数据集定义了batch size: https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/data.py#L22 故在外面设置成1.

有监督微调训练报错问题

数据格式的问题，看起来是有些pos不是字符串类型

bge-m3, tokenizer.add_tokens(key)，增加一批专业的单词后，之后如何进行微调

参考；https://discuss.huggingface.co/t/how-to-train-the-embedding-of-special-token/10837 tokenizer.add_tokens和model.resize_embedding之后，把tokenizer和model都save_pretrained下来，直接微调即可。

第三阶段的loss请教

所有任务都是使用的对比学习。