lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Fix formatting issues in XNLI tasks in Basque, Catalan, Galician and Spanish

Open juliafalcao opened this issue 7 months ago • 1 comments

This PR corrects a number of formatting issues that we have identified with the construction of prompts for the following XNLI tasks: xnli_eu, xnli_ca, xnli_gl and xnli_es_spanish_bench.

TO-DO

  • The XNLI datasets contain several documents that are not fit to be concatenated into the prompt style of these tasks, where they will be in the middle of a sentence like {premise}, ¿correcto? or No, {hypothesis}. It does not make any sense to put anything in these spaces that is not a single, affirmative sentence; there are instances that contain questions, sentences with "..." in the middle and sentences with exclamation points, which led to grammatically-incorrect (and ininteligible) prompts. We include some examples below. The solution we proposed is to filter out these documents before preprocessing. We add a filter function that removes instances where the premise or the hypothesis contains punctuation in the middle.
  • We remove the comma after "Así que" in xnli_es, because it is grammatically incorrect in Spanish.
  • We fix the pre-processing functions to ensure that: (1) all sentences start with an uppercase letter; (2) all hypotheses (which are concatenated into the middle of a full sentence) start with a lowercase letter; (3) all sentences end with a single period.

Examples of misformatted prompts

xnli_es_spanish_bench:

  • ... jirones de nubes blancas esparcidas por todo un claro cielo azul, ¿correcto? Así que, el sol está detrás de una nube esponjosa con forma de un conejito.
  • Ese fue..., ese fue un día bastante aterrador, ¿correcto? No, fue un día relajante.
  • ¿Qué te gusta más, las matemáticas o las ciencias, ¿correcto? Sí, ¿Prefiere las matemáticas o las ciencias?.

xnli_gl:

  • Por exemplo, un presidente do programa preparou a man algúns comentarios introdutorios laudatorios sobre un.., verdadeiro? Si, un presidente do programa preparou algunhas observacións introdutorias.
  • si, escóitoo, verdadeiro? Ademais, creo que o oio.
  • Debería cambiar a Linux, verdadeiro? Si, deberías cambiar o teu sistema operativo a Linux?.

xnli_ca:

  • bé per què no comences perquè has tingut més temps per pensar-hi si no t'importa, correcte? Sí, per què no ho fas tu primer?.
  • Aquest home va néixer a Alemanya, és ric, ben educat, ha viatjat.., correcte? No, aquest home va néixer a Arkansas i era pobre, no tenia educació i no va viatjar mai.
  • ... però la segona vegada que té lloc la trobada es veu atrapat al mig entre dos amics, correcte? No, només té una trobada, en què participa un amic seu.

xnli_eu:

  • Ez, neskak egia perfektuarekin erantzun zuen., ezta? Bai, Zintzoa zen, eta ezetz esan zuen.
  • 139 \"Eta, hala ere, absolbitua izan daitekeela diozu?\", ezta? Gainera, Banpiroen jaunekin lan egiteko epaiketan dago.

juliafalcao avatar Apr 02 '25 13:04 juliafalcao

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Apr 02 '25 13:04 CLAassistant