How is the jsonl file in the eval_data built, thank you very much!
I would also be very curious to hear how this file can be reproduced. I'm assuming that https://github.com/AkariAsai/self-rag/tree/main/data_creation/generator summarizes the process, but the readme seems a bit outdated and there is no reference to where the initial input file is from. I would really appreciate a reference to a place that has more guideance on this!
@Gera001 Hi, do you mind which eval task data do you want to learn the details of the creation process? @jonhue Thanks for the question! I guess you are intereted in the training data creation? Our initial instruction-tuning data comes from processed data from open-instruct as well as the KILT and some other knowledge-intensive task data we manually processed. I can upload the source data without any Self-RAG processing if it helps! I apologize the README of the training data creation is outdated... The training and evaluation is mostly done by myself, and I am only one who maintain this repository, and I've been hectic with teaching duties and other projects as a Ph.D. student... I hope I can make more time to clean up and nicely package the code bases before ICLR. Sorry for the inconvenience!
Thank you for your answer @AkariAsai! Totally understand the time constraints, and really appreciate that you try to make this project easy to reproduce! I myself am interested in trying out some different retrievers, so if you are able to share the source data before running the retriever to retrieve similar passages, it would be very helpful! Thank you :)
@AkariAsai hi I wonder know how to reproduce the eval_data/popqa_longtail.jsonl content, I use your retriever can't reproduce same answer with top 20 docs. All the passages(the tsv file) , embedding file are same ,but I can't reproduce the same results. I found your eval_data for example like popqa_longtail.jsonl which top 1 has corrective article about 0.49035025017869904 , top 20 has corrective article about 0.6426018584703359, but I use your offered embedding and tsv file only can get 0.23802716225875625 and 0.3173695496783417 , this is unreasonable, i want to why would this happen?
and i also want to know popqa_longtail_w_gs.jsonl, what's kind of artificial process have you gays down. @AkariAsai could you share this details thx