Shitao Xiao

Results 509 comments of Shitao Xiao

建议询问Milvus团队,不清楚这个问题

The loss seems large. I guess there are pseudo-negative examples in the training data (some samples in the negative sample list `neg: List[str]` are actually positive samples).

You should show your command and the environment so that I can analyze the possible cause.

存成dict,参考https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#generate-embedding-for-text 中Sparse Embedding (Lexical Weight)

@songge25 ,可以通过使用deepspeed stage-1/2/3来降低显存使用,只需修改:https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/ds_config.json#L36 中 `"stage": 0`, 0改为1、2或3(降低显存能力:3>2>1)

> 而且我观察到在run.py 中per_device_train_batch_size写死成了1,这是为什么呢? @zuoyifan132 你指的是bge-m3的向量微调代码吧。我们没有使用统一的batch size,而是在data.py里为每个数据集定义了batch size: https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/data.py#L22 故在外面设置成1.

数据格式的问题,看起来是有些pos不是字符串类型

参考;https://discuss.huggingface.co/t/how-to-train-the-embedding-of-special-token/10837 tokenizer.add_tokens和model.resize_embedding之后,把tokenizer和model都save_pretrained下来,直接微调即可。

所有任务都是使用的对比学习。