lm-evaluation-harness Quac Dataset

When will you support evaluation on the quac dataset? I found the results of the Llama2 paper difficult to reproduce. Especially regarding how to segment the answer for the base model's F1 score.

Sep 04 '23 03:09 RanchiZhao

are there any solutions? i am confused

Sep 15 '23 10:09 RanchiZhao

If nobody is already working on it, I can try this feature.

Nov 01 '23 04:11 glerzing

If nobody is already working on it, I can try this feature.

That would be great!

Nov 08 '23 07:11 StellaAthena

Hi @glerzing are you working on quac? If not, i can take that.

Nov 14 '23 04:11 Sanchit-404

Yes, I'm on it. It's a bit more complicated than I expected, it will probably take weeks.

Nov 14 '23 16:11 glerzing

Actually, I will not have the opportunity to finish it, sorry. I had other important things to do. The implementation should be quite similar to the one of CoQA, but to avoid having the same problem as with #1231 , you need a way to make a list of predictions for each document, probably by implementing construct_requests. @Sanchit-404, if you are still motivated, feel free to pick this issue.

Mar 02 '24 19:03 glerzing

Hi, any updates on this ?

@glerzing if you have a partial implementation or high level plan, please share it, will be helpful for anyone to pick it up

Mar 13 '24 21:03 adiprasad

The Python script is too much of a draft to share. It's not worth much, but here is the README.md. I would have liked to have a yaml file like this like this, with the ability to redefine construct_requests inside the yaml file, like with doc_to_text or process_results. That would be cleaner than having to implement it with a class that inherits from Task like in squadv2.

Mar 18 '24 00:03 glerzing