learning_to_retrieve_reasoning_paths icon indicating copy to clipboard operation
learning_to_retrieve_reasoning_paths copied to clipboard

Why are some document titles missing?

Open mukhal opened this issue 2 years ago • 2 comments

Thank you for the amazing repo.

I am curious why are some titles missing from the tfidf index. It seems that during evaluation we get multiple such warnings:

Oranjegekte_0 is missing
James Gunn_0 is missing
..

I assume this means that some document titles are not found in the database. Is that normal? could you explain?

Thanks!

mukhal avatar Oct 28 '21 22:10 mukhal

Hi, sorry for my late response! Could you share the command you are running and in which dataset you have that issue? I think I have seen the same issue when the Wikipedia title (id) cannot be matched with any of the ids in the database. In particular,

  • the code cannot handle well some Unicode characters
  • the Wikipedia entity titles have been changed or directed to the new one

AkariAsai avatar Mar 26 '22 21:03 AkariAsai

Thanks for the response. This happens with HotpotQA when I run the following command or similar commands.

python run_graph_retriever.py \
        --task hotpot_open \
        --bert_model bert-base-uncased --do_lower_case \
        --dev_file_path path/to/hotpotqa/dev \
        --output_dir path/to/output \
        --model_suffix 3\
        --max_para_num 10 \
        --tfidf_limit 50 \
        --beam 4\
        --eval_chunk 200 \
        --eval_batch_size 64 \
        --split_chunk 1000\
        --pruning_by_links \
        --example_limit 128 

I think the main issue is that some titles are retrieved by the tfidf retriever, but when trying to retrieve their content using tfidf_retriever.load_abstract_para_text(), it outputs this warning for some documents. Not sure if I should worry about it, though since I was able to reproduce your results with the warning happening many times.

mukhal avatar Mar 27 '22 01:03 mukhal