open-lid-dataset Add languages of the AmericasNLP MT Shared Task

Add languages of the AmericasNLP MT Shared Task

Open onadegibert opened this issue 1 year ago • 2 comments

trafficstars

The MT Shared Task on Machine Translation into Indigenous Languages covers 11 Indigenous languages of the Americas: Aymara, Bribri, Asháninka, Chatino, Guarani, Wixarika, Nahuatl, Otomí, Quechua, Shipibo-Konibo, Rarámuri. You can find data for these languages in this year's github repository.

Would it be feasible to add them to the open-lid-dataset? I would be more than happy to help to make this possible!

Thanks :)

Mar 04 '24 09:03 onadegibert

Hello! I'm excited to hear about the task and the data available, I hope it goes well!

Currently all the languages in OpenLID are included in the FLORES+ dataset so we have a level of common evaluation. Is there any scope to translate FLORES+ into the languages you cover in your shared task? Please see OLDI for more details.

In any case, I plan to add more languages in a batch at some point this year, so thank you for letting me know about this data!

Mar 04 '24 17:03 laurieburchell

Hi Laurie, sorry for taking so much time to answer.

Currently we do not have FLORES+ translations but we finally discussed it and this may be something we could do next year. We will definitely let you know!

Apr 25 '24 07:04 onadegibert

open-lid-dataset open-lid-dataset copied to clipboard

Add languages of the AmericasNLP MT Shared Task

open-lid-dataset
open-lid-dataset copied to clipboard