promptsource icon indicating copy to clipboard operation
promptsource copied to clipboard

Examples where >1 targets?

Open Muennighoff opened this issue 3 years ago • 2 comments

As you changed the signature of apply() to return a list for targets instead of a string, can you point me to some datasets that use multiple targets? Is randomly picking one the best to get it back to a single string?

Muennighoff avatar Aug 08 '22 08:08 Muennighoff

Hey @Muennighoff ! Unless I'm misremembering, this is only changed on the eval-hackathon branch. This was a feature requested by the eval team, and maybe @cjlovering or @jordiclive can point to them. My understanding is that it's motivated by generation datasets that have multiple possible valid targets. We're not 100% decided that the current API in eval-hackathon will be merged as-is to main. For example, we could keep the current API as is (but return just the first target when there are multiple) and add another method that returns all targets. Open to suggestions!

stephenbach avatar Aug 09 '22 17:08 stephenbach

@Muennighoff, GEM/web_nlg and GEM/wiki_auto_asset_turk are examples of multiple references. For example GEM/wiki_auto_asset_turk/test_asset has 10 references.

Yes, the reasoning is because in NLG, having one reference is quite often unreliable, so a lot of test sets are designed with multiple references where multi-ref metrics should be used. Multi-ref metric (for bleu, rouge, sari) support was also added to the Bigscience EH because of this. I know other NLG datasets that are intended for multi-ref test sets are E2E and Totto, but not sure if they were implemented.

For promptsource main, I would suggest supporting multi-references for these datasets. Choosing one or even a random single reference for analysis makes results not comparable.

jordiclive avatar Aug 10 '22 11:08 jordiclive