evaluation issues

Add ANLI dataset

My attempt to add the ANLI dataset (issue #32), including: - Load ANLI and reformat each of the three validation splits (R1, R2, R3) into the prompt provided by the...

omerant

Add HANS dataset

1. Evaluated on GPT2 2. Time taken: 3:40:59 on GTX 1080 Ti Other comments: 1. Prompt template used is the same as XQUAD/PIAF, with minor addition of the question "is...

aakanksha19

Add GEM Wikilingua to Full Benchmark

1

all 18 languages

epavlick

NLG

Add WMT to Full Benchmark

2

epavlick

MT

Refactor task template to merge multilingual.json and english.json

(per question raised about [slide 6](https://docs.google.com/presentation/d/1LLWFR5AElafxDK4zu4pFdw8-Rz-UGvemG6xcu2uICjE/edit?usp=sharing) at the evaluation meeting on 9/1).

marinecarpuat

feat: use promptsource templates

A simple proposal of using promptsource directly such that we don't have to implement it from scratch.

tianjianjiang

Wrap evaluation benchmark using HF-trainer

2

This might sounds like a bit of re-structuring but for the sake of future compatibility, I propose the following, 1. Move to `huggingface` trainer: This will help the repo to...

sbmaruf