collision-bert evaluation of re-ranking

evaluation of re-ranking

Open GianlucaDeStefano opened this issue 10 months ago • 0 comments

Hello, I'm trying to grasp the method used for evaluating a ranking experiment, specifically how the rankings for new documents are computed. Below is the relevant portion of the code:

collision, new_score, collision_cands = gen_aggressive_collision(
    query, best_sent, model, tokenizer, device, best_score, lm_model)

if args.verbose:
    log('---Rank shifts for less relevant documents---')
    weighted_new_score = sum(BIRCH_ALPHAS) * new_score
    for did in bm25_q_doc[qid]:
        new_score = bm25_q_doc[qid][did] * BIRCH_GAMMA + weighted_new_score * (1 - BIRCH_GAMMA)
        old_rank, old_score = target_q_doc[qid][did]
        new_rank = 1000 - bisect.bisect_left(old_scores, new_score)
        log(f'Query id={qid}, Doc id={did}, '
            f'old score={old_score:.2f}, new score={new_score:.2f}, old rank={old_rank}, new rank={new_rank}')

From the code, the collision-generation-methods produce a new_score that is linked to the query and generated collision.

What confuses me is why this new_score is directly used in ranking of low-ranking calculations for documents. According to the paper, for each of the 50 query topics, irrelevant articles ranked between 900 and 1000 by Birch are selected and collisions are inserted to boost their ranks. However, the script doesn’t seem to insert collisions into new documents; it appears to reuse the new_score for each of the low-ranking examples. Shouldn't this test involve recomputing the embedding for the low-ranking documents after inserting the collision?

Am I misunderstanding something here?

Apr 22 '24 15:04 GianlucaDeStefano

collision-bert collision-bert copied to clipboard

evaluation of re-ranking

collision-bert
collision-bert copied to clipboard