namespace-Pt comments

Results 50 comments of


                                            namespace-Pt

Did you remove punctuations before computing the document score?

BTW, I think keeping the punctuations in both query and document would result in too long posting lists.

Did you remove punctuations before computing the document score?

OK, thank you. I also wonder: how do you get your 7 negative samples, are they just ramdom sampling from negatives collected from triple file?

在蒸馏时执行run_dense.py的toy example问题

Hi, 大部分是使用run_lm_score.py得到的。方法详见[这里](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/llm_embedder/docs/fine-tune.md#lm-scoring)。你可以像我们一样使用LLM作为teacher，也可以使用一个reranker作为teacher。

关于llm产生的teacher score的疑问

Hi，msmarco大约有40w条查询，每个查询要给200个candidate评分，因此是80M个打分操作，我们使用8xA100(40G)大概花费了15h。可以适当调小candidate的数量。

关于llm产生的teacher score的疑问

代码支持给convsearch生成llm score，但我们最后没用，因为convsearch上answer比较长，可能llm score没那么准，用score效果甚至不如只用标签。别的task也需要（qa，chat，lrlm，icl），msmarco只是个例子，用于你推断打分时间。

hi, 1. 因为我们是multi-task training，因此convsearch任务可能受到别的任务的影响，使其在w.o. LLM Reward时其效果低于w. LLM Reward，但两行结果我们都没有在convsearch上使用LLM Reward； 2. 我们默认只在qa和icl上使用了，因为chat和lrlm不存在一个共享的corpus，因此in-batch negative是没有意义的，从而降低了`stabilize_distill`的有效性（其能够将distillation loss转化为一系列对比学习loss的加权平均，而对比学习强烈依赖于inbatch negative）但是在chat和lrlm上我们设置了`teacher_temperature=0.1`, 这也在一定程度上缓解了`teacher_scores`分布太平均的问题 3. 我们follow了[Replug](https://arxiv.org/pdf/2301.12652.pdf)将`teacher_temperature`设置为0.1；如果设置为1效果会较差。

关于llm产生的teacher score的疑问

我们这也没存打分前的文件 :disappointed_relieved: 你可以自己生成一份，思路如下： 1. 确定`chunk_size`（一个chunk多少token）和`candidate_num`（从多少个candidate chunk中检索） 2. 将长文本编码成input_ids并按照`chunk_size`分块，截取连续的`candidate_num+2`块 3. 最后一个chunk是`answer_inputs`，倒数第二个是`query_inputs`，剩下的都是`score_inputs` 4. `context_inputs`设为空列表即可

namespace-Pt

Did you remove punctuations before computing the document score?

Did you remove punctuations before computing the document score?

在蒸馏时执行run_dense.py的toy example问题

关于llm产生的teacher score的疑问

关于llm产生的teacher score的疑问

关于llm产生的teacher score的疑问

关于llm产生的teacher score的疑问

关于llm产生的teacher score的疑问

activation_beacon最长上下文窗口长度400K，是否与现有的长上下文模型（baichuan-192k，GPT-4-128k、kimi chat）对比评测结果

activation beacon Needle In A Haystack test failed