Jaume Zaragoza
Jaume Zaragoza
I found this [corpus](https://huggingface.co/datasets/Nart/abkhaz_text) coming from the Abkhazian National Corpus and Common Voice. So probably it won't have any language pollution and can be used for training. I [asked](https://huggingface.co/datasets/Nart/abkhaz_text/discussions/3) just...
### Bug description `lr-decay-strategy epoch+stalled` does not decay the learning rate after stalled validation. ### How to reproduce Set `--lr-decay 0.5 --lr-decay-strategy epoch+stalled --lr-decay-start 1 1` and wait until one...
Training Thai I noticed we just got 3M sentences for backtranslations and similar happened to me before with other languages. So I decided to suggest this change and avoid a...
When looking at individual sentence scores, it's difficult to tell if the translation is correct without the source, or to guess what's the source of the error. Showing the reference...
If at some point we want this to be supported, I think it should be sketched a little bit. I also had some ideas that I don't want to forget....
Specially when running complicated language pairs that may not be well supported and suffer a lot from filtering ([like](https://github.com/mozilla/translations/pull/1288#issuecomment-3559677897) Chinese Traditional), we need a detailed description of how much data...
- [ ] Update monolingual claning to use newest LID tools. - [ ] Short report of 100 languages. - [ ] Choose LID tool based on target language.
(continuation of #894 ) If we want to continue experimenting with student parameters, there are still combinations that could try. - [ ] More combinations of the parameters that Greg...
FP16 training can increase throughput a lot and may not hit quality. I'm testing it.
As we discussed yesterday, it's possible that `- ` dashes at the beginning of sentences in OpenSubstitles are inconsistent and [may cause extra tokens in the output](https://github.com/mozilla/translations/issues/215#issuecomment-3327405949). We discussed that...