masakhane-mt
masakhane-mt copied to clipboard
Update notebooks to no longer rely on JW300
Edit: see #200, maybe we should leave the old JW300 notebooks up, and instead create new ones
The problem
JW300 has been taken down for copyright reasons. At least the following notebooks all rely on it:
https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_from_English_training.ipynb https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_gdrive_from_English.ipynb https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_into_English_training.ipynb
a solution (but see #200 )
They need to be fixed to no longer use this dataset. Perhaps we could use Tatoeba or FloRES 101? Or one of the other machine translation datasets on https://huggingface.co/datasets?task_ids=task_ids:machine-translation&sort=downloads
Steps that need to be done:
- [ ] (optional) assign yourself in "Assignees" over to the right
- [ ] Try running the notebooks, in Google Colab
- [ ] See where they break.
- [ ] Edit the notebook to swap in another dataset. Perhaps by Loading in a HuggingFace dataset, and then writing it back out into a format JoeyNMT knows how to use, creating a train.en and train.xh file maybe.
- [ ] Fork the masakhane-MT repo https://docs.github.com/en/get-started/quickstart/fork-a-repo
- [ ] Swap in your updated notebook
- [ ] Make a merge request/pull request so that everyone can use the updated notebook.
So for example, this section breaks because JW300 is no longer downloadable: