chaofan comments

Results 117 comments of


                                            chaofan

rerank分数的区间是多少？

reranker的分数区间范围很大，不同reranker分数所在的区间也不一致，它主要是根据分数的相对高低来判断相关性的如果必须设置阈值的话，可以用reranker跑几个相关 / 不相关的例子分析一下

关于微调bge-reranker的max_len问题

bge_reranker_large不支持2k的长度，可以用bge-reranker-v2-m3进行微调

reranker训练损失震荡问题

主要看一下模型的表现是否正常，embedding模型微调后可以用embedding模型获取hard negatives，再对reranker进行微调

llara compatible sentence transformer?

It is not compatible. You can use it by [LLARA-usage](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/LLARA#usage)

AttributeError: 'BiDecoderOnlyEmbedderICLModel' object has no attribute 'config', when tune bge-en-icl with deepspeed zero3

Sorry, our current training is not compatible with DeepSpeed ZeRO3. We recommend using a lower stage.

试图在CPU机器上安装FlagEmbedding[finetune]，报错

这个可以参考[flash attention](https://github.com/Dao-AILab/flash-attention) 的安装方案 CPU 机器上一般是没法进行模型微调的，只用推理的话是不需要安装[finetune]的

使用bge-rerank-large对两个段落重排两次，两次重排前会对所有段落进行strip()，结果大不相同，为什么会有这么大的差异？

这个正常来讲使用统一的格式，对相对排名影响不是很大的如果结果比较差，可以考虑使用[bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)

ColBERT vector and sparse embedding are finetuned together. If you want to remove the `colbert vector`, you need to remove the code in the [finetune module](https://github.com/FlagOpen/FlagEmbedding/blob/f9f673e4ff159324d39c20a0c29686ca1e849963/FlagEmbedding/finetune/embedder/encoder_only/m3/modeling.py#L256).

Wrong similarity score for identical embeddings

Since sparse embeddings are not normalized, the sparse embedding similarity between identical embeddings cannot reach 1. It doesn't need normalization.

I'm trying HNM via hn_mine.py, but the hard negatives are gibberish.

The `pos` and `neg` samples in the dataset should be stored in list format, so that the retrieved hard negatives will be complete sentences. Otherwise, if strings are passed in,...