evaluation
evaluation copied to clipboard
Add HANS dataset
- Evaluated on GPT2
- Time taken: 3:40:59 on GTX 1080 Ti
Other comments:
- Prompt template used is the same as XQUAD/PIAF, with minor addition of the question "is this true or false?" (to indicate entailment/non-entailment)
- In addition to accuracy, other fine-grained evaluation metrics present in the HANS evaluation script (https://github.com/tommccoy1/hans/blob/master/evaluate_heur_output.py) are also added, but can be removed if deemed unnecessary.