Greg Tatum
Greg Tatum
I'm closing this experiment, but it may be worth playing around with decoder-depth and embeddings, as they improved the results the most significantly.
Edit: I rebased incorrectly. Re-running this.
Oh, and I don't have opinions on the rules themselves, copying from another source seems reasonable, but I didn't think through the rules and how they apply to CJK.
Verify the fix with: https://bugzilla.mozilla.org/show_bug.cgi?id=1888970 https://bugzilla.mozilla.org/show_bug.cgi?id=1888897 https://bugzilla.mozilla.org/show_bug.cgi?id=1884577 https://bugzilla.mozilla.org/show_bug.cgi?id=1881252 https://bugzilla.mozilla.org/show_bug.cgi?id=1864472 https://bugzilla.mozilla.org/show_bug.cgi?id=1862486 https://bugzilla.mozilla.org/show_bug.cgi?id=1853300 https://bugzilla.mozilla.org/show_bug.cgi?id=1882997 https://bugzilla.mozilla.org/show_bug.cgi?id=1879019 https://github.com/mozilla/translations/issues/699
> As we discussed before dataset modification should not be a default behavior. I don't know that this statement is correct. I remember in the review you bringing it up,...
We also discussed on our 1:1 that I'd be fine making it configurable, and defaulting to not chunking together sentences.
I'm pretty concerned with this one, especially since OpusTrainer relies so heavily on Python's whitespace splitting. We might have to rely on a fork here if we want to use...
I created 3 separate issues for different lines of investigation.
I did some light analysis of our recent runs, and their distillation gap vs the sentence counts. https://docs.google.com/spreadsheets/d/1l459Ui9J7ccdP6UMd1qDy51L8Uar2aZWbYWxOGcQqXA/edit?gid=1859623642#gid=1859623642 Data Source | Correllation -- | -- All monolingual data | 0.331...
#790 Here's another idea on applying fluency similar to HPLT to our translations.