learning_to_retrieve_reasoning_paths
learning_to_retrieve_reasoning_paths copied to clipboard
Why are some document titles missing?
Thank you for the amazing repo.
I am curious why are some titles missing from the tfidf index. It seems that during evaluation we get multiple such warnings:
Oranjegekte_0 is missing
James Gunn_0 is missing
..
I assume this means that some document titles are not found in the database. Is that normal? could you explain?
Thanks!
Hi, sorry for my late response! Could you share the command you are running and in which dataset you have that issue?
I think I have seen the same issue when the Wikipedia title (id
) cannot be matched with any of the ids in the database. In particular,
- the code cannot handle well some Unicode characters
- the Wikipedia entity titles have been changed or directed to the new one
Thanks for the response. This happens with HotpotQA when I run the following command or similar commands.
python run_graph_retriever.py \
--task hotpot_open \
--bert_model bert-base-uncased --do_lower_case \
--dev_file_path path/to/hotpotqa/dev \
--output_dir path/to/output \
--model_suffix 3\
--max_para_num 10 \
--tfidf_limit 50 \
--beam 4\
--eval_chunk 200 \
--eval_batch_size 64 \
--split_chunk 1000\
--pruning_by_links \
--example_limit 128
I think the main issue is that some titles are retrieved by the tfidf retriever, but when trying to retrieve their content using tfidf_retriever.load_abstract_para_text()
, it outputs this warning for some documents. Not sure if I should worry about it, though since I was able to reproduce your results with the warning happening many times.