firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

[Experiment] Inline noise Feb 2024

Open eu9ene opened this issue 1 year ago • 6 comments

  • produce space-tokenized alignments for teacher and student corpus to work with Tags (inline noise) modifier
  • jointly train alignments for the original and back-translated corpus to improve accuracy
  • separate producing shortlist to another step
  • add inline noise augmentation to dataset importer using unsupervised SimAlign lib based on BERT to get alignments there (fast_align won't work well on a small corpus)
  • adjust augmentation settings
  • add tests for alignment tasks using precompiled fast_align and extract_lex from toolchain
  • add a script to download toolchain binaries locally
  • add the ability to run tests locally under Docker

See the paper: SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

fixes #331 fixes #456

[skip ci]

eu9ene avatar Feb 12 '24 23:02 eu9ene

It's ready for review, but I just launched the experiment that will take a while to complete: https://firefox-ci-tc.services.mozilla.com/tasks/groups/Uv6EgA9SQdGT54nkWFwACQ

eu9ene avatar Feb 28 '24 19:02 eu9ene

I guess since I left a comment I'm not on the review anymore. I'll re-add myself and try to look at it tomorrow.

gregtatum avatar Feb 28 '24 22:02 gregtatum

https://firefox-ci-tc.services.mozilla.com/tasks/Jmrh6CCOS_23BG-dNKUC9A

I see this ultimately failed because of what appears to be a bunch of spot terminations killing the alignments-teacher task. Do you want to switch this to run on a -standard instance?

bhearsum avatar Mar 04 '24 16:03 bhearsum

https://firefox-ci-tc.services.mozilla.com/tasks/Jmrh6CCOS_23BG-dNKUC9A

I see this ultimately failed because of what appears to be a bunch of spot terminations killing the alignments-teacher task. Do you want to switch this to run on a -standard instance?

@bhearsum It's weird that we got 5 terminations in a row. I don't think that was the case before. Yes, we can switch but we should investigate what changed.

eu9ene avatar Mar 04 '24 18:03 eu9ene

https://firefox-ci-tc.services.mozilla.com/tasks/Jmrh6CCOS_23BG-dNKUC9A

I see this ultimately failed because of what appears to be a bunch of spot terminations killing the alignments-teacher task. Do you want to switch this to run on a -standard instance?

@bhearsum It's weird that we got 5 terminations in a row. I don't think that was the case before. Yes, we can switch but we should investigate what changed.

Running with a 1tb disk: https://firefox-ci-tc.services.mozilla.com/tasks/groups/DE-5HDfTSdKUsHXUUeUiOA

eu9ene avatar Mar 05 '24 00:03 eu9ene

We investigated with @bhearsum and it seems the issue is OOM. I've been monitoring it through interactive task and the memory is steadily growing and reached 62% after several hours. So I guess it just reaches 100% after 9-13 hours. And we have already 256 GB worker, so it's not an option to bump it further.fast_align is memory intensive and I'm experimenting with switching to eflomal which is supposed to be more memory efficient and accurate.

I'll continue testing things here and running experiments until it works. Then I'll create a bunch of new PRs for unrelated things based on @gregtatum's feedback and either modify this one for the main logic with @gregtatum's help or create a new one. Feel free to unsubscribe until then.

eu9ene avatar Mar 08 '24 01:03 eu9ene