lm-evaluation-harness Add GEM/ToTTo

Apr 29 '22 05:04 manandey

Can you note what models you were able to run this with?

I tried to run this off your fork, but I wasn't able to load any prompts? Are they merged into the eval_hackathon branch on PS?

Apr 30 '22 18:04 cjlovering

Can you note what models you were able to run this with?

I tried to run this off your fork, but I wasn't able to load any prompts? Are they merged into the eval_hackathon branch on PS?

Hi @cjlovering, I tried running it on GPT 2. No, the prompts are not yet merged into the eval_hackathon branch on PS.

For python -m main --device cpu --tasks gem_totto --num_fewshot 0 --limit 5 --model gpt2, the results were:

gpt2 (), limit: 5, provide_description: False, num_fewshot: 0, batch_size: None
|  Task   |          Prompt           |Version|      Metric       | Value |   |Stderr|
|---------|---------------------------|------:|-------------------|------:|---|-----:|
|gem_totto|final_text_describing_table|      0|bleu               |13.7120|±  |7.6106|
|gem_totto|final_text_describing_table|       |rouge1_precision   | 0.1914|±  |0.0706|
|gem_totto|final_text_describing_table|       |rouge1_recall      | 0.5547|±  |0.1688|
|gem_totto|final_text_describing_table|       |rouge1_fmeasure    | 0.2827|±  |0.0991|
|gem_totto|final_text_describing_table|       |rouge2_precision   | 0.1455|±  |0.0721|
|gem_totto|final_text_describing_table|       |rouge2_recall      | 0.4259|±  |0.1743|
|gem_totto|final_text_describing_table|       |rouge2_fmeasure    | 0.2151|±  |0.1019|
|gem_totto|final_text_describing_table|       |rougeL_precision   | 0.1825|±  |0.0695|
|gem_totto|final_text_describing_table|       |rougeL_recall      | 0.5261|±  |0.1613|
|gem_totto|final_text_describing_table|       |rougeL_fmeasure    | 0.2691|±  |0.0969|
|gem_totto|final_text_describing_table|       |rougeLsum_precision| 0.1914|±  |0.0706|
|gem_totto|final_text_describing_table|       |rougeLsum_recall   | 0.5547|±  |0.1688|
|gem_totto|final_text_describing_table|       |rougeLsum_fmeasure | 0.2827|±  |0.0991|
|gem_totto|guess the table page title |      0|bleu               | 0.2054|±  |0.0348|
|gem_totto|guess the table page title |       |rouge1_precision   | 0.0097|±  |0.0059|
|gem_totto|guess the table page title |       |rouge1_recall      | 0.1667|±  |0.1054|
|gem_totto|guess the table page title |       |rouge1_fmeasure    | 0.0182|±  |0.0112|
|gem_totto|guess the table page title |       |rouge2_precision   | 0.0000|±  |0.0000|
|gem_totto|guess the table page title |       |rouge2_recall      | 0.0000|±  |0.0000|
|gem_totto|guess the table page title |       |rouge2_fmeasure    | 0.0000|±  |0.0000|
|gem_totto|guess the table page title |       |rougeL_precision   | 0.0097|±  |0.0059|
|gem_totto|guess the table page title |       |rougeL_recall      | 0.1667|±  |0.1054|
|gem_totto|guess the table page title |       |rougeL_fmeasure    | 0.0182|±  |0.0112|
|gem_totto|guess the table page title |       |rougeLsum_precision| 0.0097|±  |0.0059|
|gem_totto|guess the table page title |       |rougeLsum_recall   | 0.1667|±  |0.1054|
|gem_totto|guess the table page title |       |rougeLsum_fmeasure | 0.0182|±  |0.0112|
|gem_totto|guess the table webpage url|      0|bleu               | 1.8425|±  |0.7780|
|gem_totto|guess the table webpage url|       |rouge1_precision   | 0.0341|±  |0.0122|
|gem_totto|guess the table webpage url|       |rouge1_recall      | 0.1893|±  |0.0648|
|gem_totto|guess the table webpage url|       |rouge1_fmeasure    | 0.0577|±  |0.0206|
|gem_totto|guess the table webpage url|       |rouge2_precision   | 0.0148|±  |0.0099|
|gem_totto|guess the table webpage url|       |rouge2_recall      | 0.0905|±  |0.0585|
|gem_totto|guess the table webpage url|       |rouge2_fmeasure    | 0.0254|±  |0.0170|
|gem_totto|guess the table webpage url|       |rougeL_precision   | 0.0341|±  |0.0122|
|gem_totto|guess the table webpage url|       |rougeL_recall      | 0.1893|±  |0.0648|
|gem_totto|guess the table webpage url|       |rougeL_fmeasure    | 0.0577|±  |0.0206|
|gem_totto|guess the table webpage url|       |rougeLsum_precision| 0.0341|±  |0.0122|
|gem_totto|guess the table webpage url|       |rougeLsum_recall   | 0.1893|±  |0.0648|
|gem_totto|guess the table webpage url|       |rougeLsum_fmeasure | 0.0577|±  |0.0206|

Thanks to @jon-tow for helping me fix the issues I was facing.

May 02 '22 17:05 manandey

Great -- its looking good! Let's wait on merging til the prompts get pulled in on the PS side.

May 02 '22 18:05 cjlovering

@manandey How is this going? Is GEM/ToTTO merged into promptsource?

May 11 '22 15:05 cjlovering

Hi @cjlovering, the PR was raised a long time back for this in promptsource, but the review process was a bit slow. Now, again changes have been suggested for the templates created. Hope the PR gets merged soon. Will keep you posted.

May 11 '22 15:05 manandey