firefox-translations-training
firefox-translations-training copied to clipboard
[Experiment] Inline noise Feb 2024
- produce space-tokenized alignments for teacher and student corpus to work with Tags (inline noise) modifier
- jointly train alignments for the original and back-translated corpus to improve accuracy
- separate producing shortlist to another step
- add inline noise augmentation to dataset importer using unsupervised SimAlign lib based on BERT to get alignments there (fast_align won't work well on a small corpus)
- adjust augmentation settings
- add tests for alignment tasks using precompiled
fast_align
andextract_lex
fromtoolchain
- add a script to download toolchain binaries locally
- add the ability to run tests locally under Docker
See the paper: SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings
fixes #331 fixes #456
[skip ci]
It's ready for review, but I just launched the experiment that will take a while to complete: https://firefox-ci-tc.services.mozilla.com/tasks/groups/Uv6EgA9SQdGT54nkWFwACQ
I guess since I left a comment I'm not on the review anymore. I'll re-add myself and try to look at it tomorrow.
https://firefox-ci-tc.services.mozilla.com/tasks/Jmrh6CCOS_23BG-dNKUC9A
I see this ultimately failed because of what appears to be a bunch of spot terminations killing the alignments-teacher
task. Do you want to switch this to run on a -standard
instance?
https://firefox-ci-tc.services.mozilla.com/tasks/Jmrh6CCOS_23BG-dNKUC9A
I see this ultimately failed because of what appears to be a bunch of spot terminations killing the
alignments-teacher
task. Do you want to switch this to run on a-standard
instance?
@bhearsum It's weird that we got 5 terminations in a row. I don't think that was the case before. Yes, we can switch but we should investigate what changed.
https://firefox-ci-tc.services.mozilla.com/tasks/Jmrh6CCOS_23BG-dNKUC9A
I see this ultimately failed because of what appears to be a bunch of spot terminations killing the
alignments-teacher
task. Do you want to switch this to run on a-standard
instance?@bhearsum It's weird that we got 5 terminations in a row. I don't think that was the case before. Yes, we can switch but we should investigate what changed.
Running with a 1tb disk: https://firefox-ci-tc.services.mozilla.com/tasks/groups/DE-5HDfTSdKUsHXUUeUiOA
We investigated with @bhearsum and it seems the issue is OOM. I've been monitoring it through interactive task and the memory is steadily growing and reached 62% after several hours. So I guess it just reaches 100% after 9-13 hours. And we have already 256 GB worker, so it's not an option to bump it further.fast_align
is memory intensive and I'm experimenting with switching to eflomal which is supposed to be more memory efficient and accurate.
I'll continue testing things here and running experiments until it works. Then I'll create a bunch of new PRs for unrelated things based on @gregtatum's feedback and either modify this one for the main logic with @gregtatum's help or create a new one. Feel free to unsubscribe until then.