FlagEmbedding issues

挖掘hard negatives命令报错

3

```bash pretrained_model=/XXX/bge-large-zh-v1.5` raw_data=/XXX/toy_finetune_data.jsonl after_data=/XXX/toy_finetune_data_minedHN.jsonl python3 -m FlagEmbedding.baai_general_embedding.finetune.hn_mine \ --model_name_or_path ${pretrained_model} \ --input_file ${raw_data} \ --output_file ${after_data} \ --range_for_sampling 2-7 \ --negative_number 5 ``` 去掉了GPU选项，根据toy_data改了下数量参数，toy_data没动，下面是错误信息： ----------using 8*GPUs---------- inferencing embedding for corpus...

Daya-Jin

关于batch_size的计算方式

2

您好，论文里提到的batch_size=19200，指的是Device_num **X** per_device_batch_size **X** accumulation_steps这样计算出来的结果吗

zhaobinNF

多卡推理DistributedDataParallel问题

3

您好，我尝试使用多个GPU来对很大量的文本进行embedding，目前是直接使用model.encode()方法。我观察到使用多GPU后，实际性能提升要比理论上的提升要小得多，而且单卡推理时GPU占用率可持续达到100%，但多卡推理时很多时候都在100%以下甚至0%。查询文档之后发现FlagModel似乎采用的是DataParallel来进行多卡推理，而不是DistributedDataParallel。想问一下是否和这个设定有关？如果想提高多卡推理效率的话，应该怎么调整呢？谢谢！

tomleung1996

Threshold for unlabeled data

4

There is a paragraph like this on page 4 of your technical report: > The text pairs selected from the web and other public sources are not guaranteed to be...

iambestfeeddddd

activation_beacon最长上下文窗口长度400K，是否与现有的长上下文模型（baichuan-192k，GPT-4-128k、kimi chat）对比评测结果

1

看论文主要是跟微调方法（如Positional Interpolation、NTK-Aware Scale ROPE和StreamingLLM）比较有没有跟现有商业长上下文模型准确度对比评测结果？想知道该技术方案的效果

cnsky2016

llm-embedder微调的数据样例

3

您好，方便给一两条NQ数据集的训练和验证数据样例吗？看scheme有些东西没太确定。另外请教下：teacher_scores, pos, answers的长度一样吗？或者 teacher_scores[i]= p(answer[i]|pos[i]) ？ ``` # training { "query": str, "pos": List[str], "neg": List[str], "pos_index": Optional[List[int]], # Indices of the positives w.r.t. the corpus. When a global...

lierer007

bge-large-zh 在微调中loss间接出现0.0，这是正常现象吗？

4

![微信图片_20240123104453](https://github.com/FlagOpen/FlagEmbedding/assets/44698199/1ae0f7e9-5bd8-40a1-a0e8-4e42de07668a) train下来的模型是可以用的，单纯想问一下

128Ghe980

LM_Cocktail "Merge based on samples" ERROR: Subtraction, the `-` operator, with a bool tensor is not supported

Problem: megerging chatglm3 with samples, it gives out error Traceback (most recent call last): File "~/llm_cocktail/mix_mdl.py", line 67, in model2 = mix_models_with_data( File "~/miniconda3/envs/train_py310/lib/python3.10/site-packages/LM_Cocktail/cocktail.py", line 102, in mix_models_with_data weights =...

charliedream1

how could I set range_for_sampling during reranker model fine tune? thnks

2

In https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives I saw that: range_for_sampling: where to sample negative. For example, 2-100 means sampling negative from top2-top200 documents. You can set larger value to reduce the difficulty of negatives...

Yazooliu

训练时提示“Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.”，推理时报错找不到tokenzier

3

训练的时候就有提示 “Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained” ![image](https://github.com/FlagOpen/FlagEmbedding/assets/89055561/8718e027-c1fa-403a-ba71-d41b1f86ea68) 推理的时候，使用From_pretrained_tokenizer的时候会报错os.error，找不到文件。经过排查，发现是保存的finetune过的模型中，tokenizer_config.json文件与基座模型的tokenizer_config.json不同。 ![image](https://github.com/FlagOpen/FlagEmbedding/assets/89055561/697fdf0a-8a13-4a1d-ab70-ed49e7aeb9d5) 似乎表明其依赖原来基座模型的tokenizer_config.json文件。因为我训练和部署的机器不同，所以找不到这个文件就报错了。当我把基座模型的文件放到对应位置，模型才能正常使用。感觉这是一个BUG，我之前用早期版本的bge脚本训练的时候并没有这个问题，tokenizer_config文件并不会改变。我使用的库版本如下： sentence-transformers 2.2.2 transformers 4.34.0...

blue-vision0

FlagEmbedding
FlagEmbedding copied to clipboard

Metadata

挖掘hard negatives命令报错

关于batch_size的计算方式

多卡推理DistributedDataParallel问题

Threshold for unlabeled data

activation_beacon最长上下文窗口长度400K，是否与现有的长上下文模型（baichuan-192k，GPT-4-128k、kimi chat）对比评测结果

llm-embedder微调的数据样例

bge-large-zh 在微调中loss间接出现0.0，这是正常现象吗？

LM_Cocktail "Merge based on samples" ERROR: Subtraction, the `-` operator, with a bool tensor is not supported

how could I set range_for_sampling during reranker model fine tune? thnks

训练时提示“Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.”，推理时报错找不到tokenzier

← Metadata

Owner

Metadata

FlagEmbedding FlagEmbedding copied to clipboard

Metadata

← Metadata

Owner

Metadata

FlagEmbedding
FlagEmbedding copied to clipboard