firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Taskcluster publication

Open La0 opened this issue 10 months ago • 1 comments

Refs #333

La0 avatar Mar 27 '24 15:03 La0

Got a training task definition with

  • taskcluster proxy enabled
  • tracking code install through pip
  • taskcluster secret env variable set

La0 avatar Mar 28 '24 10:03 La0

I compared the new train-teacher log form CI with the previous one. The wrapper should not modify the original log. It should stream stdout and stderr of the wrapped program as is and optionally add some logging records with the same convention. For example: [tracking] [INFO] Reading logs stream.

I wonder if this is related to my comment about stdout & stderr getting combined? I imagine that separating those (and handling both streams) might result in better behaviour. Even so - I'm not certain whether or not a downstream consumer can guarantee that they will process messages from two separate pipes in the same order they would've been printed to the console. (Maybe it is possible...I honestly don't know.)

bhearsum avatar Apr 15 '24 23:04 bhearsum

I compared the new train-teacher log form CI with the previous one. The wrapper should not modify the original log. It should stream stdout and stderr of the wrapped program as is and optionally add some logging records with the same convention. For example: [tracking] [INFO] Reading logs stream.

I wonder if this is related to my comment about stdout & stderr getting combined? I imagine that separating those (and handling both streams) might result in better behaviour. Even so - I'm not certain whether or not a downstream consumer can guarantee that they will process messages from two separate pipes in the same order they would've been printed to the console. (Maybe it is possible...I honestly don't know.)

We discussed live that it's ok to merge those two streams because they are intentionally used by OpusTrainer to output stuff. I guess it was used to separate Marian log and OpusTrainer's own log. Unless there's a way to easily redirect both separately using bash piping.

eu9ene avatar Apr 16 '24 16:04 eu9ene

Some other things we discussed should be implemented before merging:

  • we should investigate warnings with validation publishing. Indeed I don't see validation graphs on https://wandb.ai/moz-translations/moz-translations
  • we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards
  • as in the previous comment, preserving log format and adding [tracking] prefix for the parser
  • using the proper project/experiment name for publishing right away as we'll start using this when it's merged (should be ru-en/ci_<task_cluster_id> for the test CI run when enabled
  • (from the previous comment) making sure all the training and fine-tuning steps are tracked as we'll start using it for real experiments

eu9ene avatar Apr 16 '24 16:04 eu9ene

Some other things we discussed should be implemented before merging: * we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards

Apologies that I missed the meeting. One possible alternative here, if you want, might be to have a separate publishing location (some sort of sandbox) that we use for these cases.

bhearsum avatar Apr 16 '24 16:04 bhearsum

Some other things we discussed should be implemented before merging:

  • we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards

Apologies that I missed the meeting. One possible alternative here, if you want, might be to have a separate publishing location (some sort of sandbox) that we use for these cases.

It's a good idea. We could create a staging namespace for testing, I just don't know if we want to always publish there. The namespace can be called something like moz-translations-staging and then we can specify the namespace in the config. An empty value can mean that we don't want to publish anywhere. This namespace will be our sandbox for testing before rolling out to production. If everyone is ok with this approach I'll ask the MLOps team to create it for us. cc @gregtatum

I think in the short term we can implement it and put an empty string into the CI config. Then when the new namespace is created we'll be able to retest the full reuploading of the old experiments there instead of Teklia project.

eu9ene avatar Apr 16 '24 17:04 eu9ene

Some other things we discussed should be implemented before merging:

  • we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards

Apologies that I missed the meeting. One possible alternative here, if you want, might be to have a separate publishing location (some sort of sandbox) that we use for these cases.

It's a good idea. We could create a staging namespace for testing, I just don't know if we want to always publish there. The namespace can be called something like moz-translations-staging and then we can specify the namespace in the config. An empty value can mean that we don't want to publish anywhere. This namespace will be our sandbox for testing before rolling out to production. If everyone is ok with this approach I'll ask the MLOps team to create it for us. cc @gregtatum

I think in the short term we can implement it and put an empty string into the CI config. Then when the new namespace is created we'll be able to retest the full reuploading of the old experiments there instead of Teklia project.

UPD: I looked at the code and I see that what W&B namespace we write to is controlled by the secret token, so to implement this we would need to deal with secrets. Let's leave it out of scope for this PR and just make sure we can disable publishing from the experiment config.

eu9ene avatar Apr 16 '24 17:04 eu9ene

  • Publication is enabled for all training tasks now
  • CI is now :ok: again, I had to add a check to not trigger publication during unit tests
  • Weight & Biases project, group and run are now set using TC info & languages
  • I investigated why the validation metrics were not published and found out this is a regression due to different format for Marian 1.12 (which is currently in-use). The parser was initially targetting 1.10 and did not support the perplexity metric. @vrigal added support for that a while back, but this metric is not always present

For example in this task we see the [valid] lines that are parsed:

[task 2024-04-19T15:12:42.196Z] [2024-04-19 15:12:42] [valid] Ep. 1 : Up. 50 : chrf : 0.207837 : new best
[task 2024-04-19T15:12:45.034Z] [2024-04-19 15:12:45] [valid] Ep. 1 : Up. 50 : ce-mean-words : 7.22975 : new best
[task 2024-04-19T15:16:29.458Z] [2024-04-19 15:16:29] [valid] Ep. 1 : Up. 50 : bleu-detok : 0 : stalled 6 times (last best: 0)

No perplexity found, so this check fails.

@eu9ene Is this expected behaviour ? Can we set perplexity as optional ?

La0 avatar Apr 19 '24 18:04 La0

@eu9ene Is this expected behaviour ? Can we set perplexity as optional ?

@La0 All the metrics that are used for validation are set in the Marian config (--valid-metrics), so they are all kind of optional. We usually don't use perplexity. We set them here https://github.com/mozilla/firefox-translations-training/blob/145a84ace322184b5c5d6e9aca0ececb037c5d08/pipeline/train/train.sh#L40

eu9ene avatar Apr 19 '24 22:04 eu9ene

Last changes:

  • I made perplexity optional so that validation logs are correctly parsed
  • A training parameter wandb-publication has been added, disabled by default for CI tasks. You can control it when starting a new training

La0 avatar Apr 22 '24 15:04 La0

I launched a test training run from this branch https://firefox-ci-tc.services.mozilla.com/tasks/SngAEYS-RyWnY4GgKJTzSQ

eu9ene avatar Apr 24 '24 02:04 eu9ene

I launched a test training run from this branch https://firefox-ci-tc.services.mozilla.com/tasks/SngAEYS-RyWnY4GgKJTzSQ

I do not see any training task in the group, maybe the cache did not allow your tasks to run ?

La0 avatar Apr 24 '24 08:04 La0

I launched a test training run from this branch https://firefox-ci-tc.services.mozilla.com/tasks/SngAEYS-RyWnY4GgKJTzSQ

I do not see any training task in the group, maybe the cache did not allow your tasks to run ?

It's running https://firefox-ci-tc.services.mozilla.com/tasks/PDAfVzlWQMam4uqhndjafg. You're probably looking at the actions group which is different from the tasks group. Also, I ran it from a push task.

It looks working! https://wandb.ai/moz-translations/en-ru/groups/test_tracking_SngAEYS-RyWnY4GgKJTzSQ

eu9ene avatar Apr 24 '24 15:04 eu9ene

I rebased on latest main commits, and removed the extra commit that tweaked training parameters.

La0 avatar Apr 25 '24 07:04 La0

Ok, it looks good. If we find more issues or have more ideas on what to improve let's handle them in the follow-up tasks. We'll need to reupload the experiments anyway when it's feature complete. We need to make sure the publishing from CI is disabled though not to pollute the ru-en project.

eu9ene avatar Apr 29 '24 17:04 eu9ene