ColBERT
ColBERT copied to clipboard
Duplicate search results when `k` is a high value
Hello! I tried to search using a large k
value. I noticed that ColBERT returns some unique results, but when there aren't enough results, it soon starts returning duplicate results (with the same passage ID) until it reaches the k
value specified.
For example, the code below:
query = "some question" # or supply your own query
print(f"Question: {query}")
results = searcher.search(query, k=1000)
print(f"NUM RESULTS {len(results[0])}")
unique_pids = set()
repeated_pids = []
for passage_id, passage_rank, passage_score in zip(*results):
text = searcher.collection[passage_id]
if passage_id in unique_pids:
repeated_pids.append(passage_id)
continue
unique_pids.add(passage_id)
print(len(repeated_pids), repeated_pids)
print(len(unique_pids), unique_pids)
prints the following:
Question: some question
NUM RESULTS 1000
872 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, ...] # seems to duplicate just one passage over and over again
128 {1, 1027, 1034, 2572, 1041, 1043, ...}
Shouldn't ColBERT only be returning unique results? Is this is a known bug?
This is very unusual, it should not return repeated passages unless there’s some recent change causing that.
do you have a guess @santhnm2 @detaos @jessiejuachon
Agree with @okhat, this is very strange - I'll look into it.
I'm running into another issue with large k
values.
Running search with any large k
, e.g. 100, returns exactly 64 passages.
@s-jse I've experienced that issue as well. The number of passages returned seems to get stuck at a specific number (a power of 2) no matter what value is passed. This was noticed to have happened whenever we vary the k value in a for-loop and it freezes at a specific number beyond which it doesn't return results even if they exist. Code snippet used to test this:
def get_answers(query):
for k in [1, 10, 100]:
print(f"Question: {query} k: {k}")
results = searcher.search(query, k=k)
print(f"NUM RESULTS: {len(results[0])}")
unique_pids = set()
for passage_id, passage_rank, passage_score in zip(*results):
print(f"passage_id={passage_id} passage_rank={passage_rank} passage_score={passage_score}")
unique_pids.add(passage_id)
print(f"UNIQUE PIDS: {len(unique_pids)}")
sleep(2)
if __name__ == '__main__':
searcher, output_indexes, output_content = load_model()
narratives = load_dataset()
for i in range(100):
narrative_dict = narratives[i]
get_answers(narrative_dict['question'])
I used Colbert for quite a lot those past days and I fixed this one for myself. So I wanted to share a solution. I submitted a PR. It's my first time submitting to an open source project and welcome any feedbacks.
Cheers
@paul7Junior really appreciate this! We’ll take a look!
Hello! I tried to search using a large
k
value. I noticed that ColBERT returns some unique results, but when there aren't enough results, it soon starts returning duplicate results (with the same passage ID) until it reaches thek
value specified.For example, the code below:
query = "some question" # or supply your own query print(f"Question: {query}") results = searcher.search(query, k=1000) print(f"NUM RESULTS {len(results[0])}") unique_pids = set() repeated_pids = [] for passage_id, passage_rank, passage_score in zip(*results): text = searcher.collection[passage_id] if passage_id in unique_pids: repeated_pids.append(passage_id) continue unique_pids.add(passage_id) print(len(repeated_pids), repeated_pids) print(len(unique_pids), unique_pids)
prints the following:
Question: some question NUM RESULTS 1000 872 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, ...] # seems to duplicate just one passage over and over again 128 {1, 1027, 1034, 2572, 1041, 1043, ...}
Shouldn't ColBERT only be returning unique results? Is this is a known bug?
User code below, I was facing the same Issue, but using this custom_config object I was able to get 100 results or more. custom_config = ColBERTConfig(ncells=1000, ndocs=1000,reranker=True) with Run().context(RunConfig(experiment='notebook')): searcher = Searcher(index=index_name,config=custom_config)
Hey Ravi Kumar,
Thanks for your comment,
Can you just post here the version of colbert-ai that you're using.
Also, when you say you was able to get 100+ results, are you talking about 100+ results of the same passage id?