FlagEmbedding issues

BUG: bge-m3 no_in_batch_neg 数据微调，计算ensemble_scores时get_local_score方法存在数组越界

1

#运行脚本官方example：https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/embedder/encoder_only/m3_same_dataset.sh 上述脚本，修改per_device_train_batch_size=4（原始是2不会报错） # 报错代码： ``` File "/mnt/bn/rc-tob-lq/users/huangrong.max/FlagEmbedding/FlagEmbedding/finetune/embedder/encoder_only/m3/modeling.py", line 426, in forward [rank1]: ensemble_scores, ensemble_loss = compute_loss_func( [rank1]: File "/mnt/bn/rc-tob-lq/users/huangrong.max/FlagEmbedding/FlagEmbedding/abc/finetune/embedder/AbsModeling.py", line 149, in _compute_no_in_batch_neg_loss [rank1]: local_scores = self.compute_local_score(q_reps, p_reps, compute_score_func,...

KeepGoingCSU

np.save 带来的误差影响

1

``` import os os.environ["CUDA_VISIBLE_DEVICES"] = "1" from FlagEmbedding import BGEM3FlagModel import pandas as pd import numpy as np if __name__ == "__main__": model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True) df = pd.read_parquet("00000.parquet") npy_data...

dbcSep03

bge-m3使用带标注数据finetue，negatives_cross_device设置问题

3

背景：请问我做bge-m3的finetune，训练数据是自有的分类数据，最终finetune样本示例： {"query": "这商品真差，质量一点也不好", "pos": ["评价非常差，评分应为1"], "neg":["评价非常高，评分应为5"，"评价还不错，评分应为4"，"评价一般，评分应为3"，"评价一般，评分应为2"] 问题： 1. 我理解finetune阶段如果negatives_cross_device设置为true，则会采样其他样本的neg扩充negatives，但元数据是5分类，大概率采样到当前样本对应的pos？这个需要去修改训练参数吗？ 2. 但官方bge-m3的finetune脚本https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/embedder/encoder_only/m3_same_dataset.sh，里面也有一个分类任务/example_data/classification-no_in_batch_neg，看起来negatives_cross_device会产生通用的问题，导致计算nce loss的时候负样本里大概率包含正样本？ 3. 通过2中脚本的设置，我做了消融实验（数据包含ABCD四标签，均剔除一个标签D的数据作为outdomain测试），用原始bge-m3，自有监督数据训练bge-m3-w-label、自有数据+bge-m3-data混合训练。发现加上自有数据后，在ABC分类上比bge-m3都要强不少，但是D的分类上，加入自有数据随着step增加，性能会不断下降且低于原始bge-m3。请问有什么合适的方案去减少这种知识遗忘吗？在D分类上能尽量不损失性能。

KeepGoingCSU

Liger kernel support

1

Is liger kernel supported? [Liger kernel](https://github.com/linkedin/Liger-Kernel) can increase training throughput (+20%) and significantly reduce memory usage (-60%).

ccdv-ai

Using scatter_reduce instead of scatter and max

1

Thank you for sharing your outstanding work Using scatter_reduce instead of scatter allows you to create a tensor of shape (bs, vocab_size) instead of (bs, length, vocab_size), which reduces memory...

lsrock1

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.

1

`ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone...

sherlockma11

Source of training queries of BGE-EN-ICL (BEIR datasets)

1

I have some questions regarding the origin of the [training queries used for BGE-EN-ICL](https://huggingface.co/datasets/cfli/bge-full-data), which have no training queries in BEIR: * **quora**: 10k test, 5k dev queries in beir...

ftvalentini

多线程报错问题

2

Traceback (most recent call last): File "/xxx/anaconda/envs/LLM/lib/python3.10/site-packages/FlagEmbedding/abc/inference/AbsReranker.py", line 229, in __del__ File "/xxx/anaconda/envs/LLM/lib/python3.10/site-packages/FlagEmbedding/abc/inference/AbsReranker.py", line 88, in stop_self_pool File "/xxx/anaconda/envs/LLM/lib/python3.10/site-packages/FlagEmbedding/abc/inference/AbsReranker.py", line 350, in stop_multi_process_pool File "/xxx/anaconda/envs/LLM/lib/python3.10/multiprocessing/process.py", line 133, in terminate File...

tdtgi

在CPU模式下，使用 bge-m3 或bge-large-zh 模型的同一个 BGEM3FlagModel 对象实例在多线程中调用 encode 计算向量会导致计算的向量不正确吗？

2

在CPU模式下，加载 bge-m3 或bge-large-zh 模型。同一个 BGEM3FlagModel 的对象实例在多线程中调用其 encode 方法计算向量会导致计算的向量不正确吗？

lotuswater

复现 BGE w.o. Pretrain 效果不佳

使用chinese-roberta-large + MTP unlabel zh，用如下设置进行训练： --num_gpus 8 --per_device_train_batch_size 2400 \ --do_lower_case true \ --learning_rate 1e-5 \ --weight_decay 0.001 \ --warmup_ratio 0.05 \ --temperature 0.02 \ --num_train_epochs 3 \ --train_group_size...

yilunyoufy

FlagEmbedding
FlagEmbedding copied to clipboard

Metadata

BUG: bge-m3 no_in_batch_neg 数据微调，计算ensemble_scores时get_local_score方法存在数组越界

np.save 带来的误差影响

bge-m3使用带标注数据finetue，negatives_cross_device设置问题

Liger kernel support

Using scatter_reduce instead of scatter and max

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.

Source of training queries of BGE-EN-ICL (BEIR datasets)

多线程报错问题

在CPU模式下，使用 bge-m3 或bge-large-zh 模型的同一个 BGEM3FlagModel 对象实例在多线程中调用 encode 计算向量会导致计算的向量不正确吗？

复现 BGE w.o. Pretrain 效果不佳

← Metadata

Owner

Metadata

FlagEmbedding FlagEmbedding copied to clipboard

Metadata

← Metadata

Owner

Metadata

FlagEmbedding
FlagEmbedding copied to clipboard