masakhane-mt
masakhane-mt copied to clipboard
Create new notebooks that do not rely on JW300
Slack discussion: https://masakhane-nlp.slack.com/archives/C01JAP67HRV/p1634844082006400
https://github.com/joeynmt/joeynmt/blob/master/joey_demo.ipynb is the Tatoeba example.
One suggestion in the slack would be to break the new notebook code into two parts
- One notebook that takes in a HuggingFace dataset at the top, and proceeds from there to train a JoeyNMT model. This might make things a lot easier on people. If they can get data into the HuggingFace Dataset format, we can show them how to train.
- One notebook that shows people how to do it: loads in data from various filetypes or sources (.csv, paired text files, directly from the HuggingFace hub) to HuggingFace format: https://huggingface.co/docs/datasets/loading_datasets.html
One suggestion in the slack would be to break the new notebook code into two parts
* One notebook that takes in a HuggingFace dataset at the top, and proceeds from there to train a JoeyNMT model. This might make things a lot easier on people. If they can get data into the HuggingFace Dataset format, we can show them how to train. * One notebook that shows people how to do it: loads in data from various filetypes or sources (.csv, paired text files, directly from the HuggingFace hub) to HuggingFace format: https://huggingface.co/docs/datasets/loading_datasets.html
See this slack discussion: https://masakhane-nlp.slack.com/archives/C01GF5XJ0TF/p1634863777007500?thread_ts=1634844471.007300&cid=C01GF5XJ0TF
https://colab.research.google.com/drive/1RWOle7RHy_wq0uGWxmAq1ZfmEQIFsCHj#scrollTo=h1Ddy4_AOKdm could make for a starting point. This notebook shows how to download a HuggingFace dataset and write it out to files of the format JoeyNMT expects... I think
@cdleong if this is still relevant, I would like to work on it.
I think it is still relevant, yes. And I just got done with my semester so I might have more free time as well, after the holidays
On Mon, Dec 12, 2022, 1:43 PM Benjamin Beilharz @.***> wrote:
@cdleong https://github.com/cdleong if this is still relevant, I would like to work on it.
— Reply to this email directly, view it on GitHub https://github.com/masakhane-io/masakhane-mt/issues/200#issuecomment-1346174627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LHRL4ICNYUXJOEAVODMTWM3XSNANCNFSM5GO6SEBQ . You are receiving this because you were mentioned.Message ID: @.***>
Alright, so I have started with the notebook and will be done by the end of next week. I have to prepare for an exam next Wednesday, but I will be wrapping up the notebook.
/self-assign
Alright, so I have started with the notebook and will be done by the end of next week. I have to prepare for an exam next Wednesday, but I will be wrapping up the notebook.
/self-assign
Any update?