firefox-translations-training Taskcluster publication

Refs #333

Mar 27 '24 15:03 La0

Got a training task definition with

taskcluster proxy enabled
tracking code install through pip
taskcluster secret env variable set

Mar 28 '24 10:03 La0

I compared the new train-teacher log form CI with the previous one. The wrapper should not modify the original log. It should stream stdout and stderr of the wrapped program as is and optionally add some logging records with the same convention. For example: [tracking] [INFO] Reading logs stream.

I wonder if this is related to my comment about stdout & stderr getting combined? I imagine that separating those (and handling both streams) might result in better behaviour. Even so - I'm not certain whether or not a downstream consumer can guarantee that they will process messages from two separate pipes in the same order they would've been printed to the console. (Maybe it is possible...I honestly don't know.)

Apr 15 '24 23:04 bhearsum

I compared the new train-teacher log form CI with the previous one. The wrapper should not modify the original log. It should stream stdout and stderr of the wrapped program as is and optionally add some logging records with the same convention. For example: [tracking] [INFO] Reading logs stream.

I wonder if this is related to my comment about stdout & stderr getting combined? I imagine that separating those (and handling both streams) might result in better behaviour. Even so - I'm not certain whether or not a downstream consumer can guarantee that they will process messages from two separate pipes in the same order they would've been printed to the console. (Maybe it is possible...I honestly don't know.)

We discussed live that it's ok to merge those two streams because they are intentionally used by OpusTrainer to output stuff. I guess it was used to separate Marian log and OpusTrainer's own log. Unless there's a way to easily redirect both separately using bash piping.

Apr 16 '24 16:04 eu9ene

Some other things we discussed should be implemented before merging:

we should investigate warnings with validation publishing. Indeed I don't see validation graphs on https://wandb.ai/moz-translations/moz-translations
we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards
as in the previous comment, preserving log format and adding [tracking] prefix for the parser
using the proper project/experiment name for publishing right away as we'll start using this when it's merged (should be ru-en/ci_<task_cluster_id> for the test CI run when enabled
(from the previous comment) making sure all the training and fine-tuning steps are tracked as we'll start using it for real experiments

Apr 16 '24 16:04 eu9ene

Some other things we discussed should be implemented before merging: * we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards

Apologies that I missed the meeting. One possible alternative here, if you want, might be to have a separate publishing location (some sort of sandbox) that we use for these cases.

Apr 16 '24 16:04 bhearsum

Some other things we discussed should be implemented before merging:

we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards

Apologies that I missed the meeting. One possible alternative here, if you want, might be to have a separate publishing location (some sort of sandbox) that we use for these cases.

It's a good idea. We could create a staging namespace for testing, I just don't know if we want to always publish there. The namespace can be called something like moz-translations-staging and then we can specify the namespace in the config. An empty value can mean that we don't want to publish anywhere. This namespace will be our sandbox for testing before rolling out to production. If everyone is ok with this approach I'll ask the MLOps team to create it for us. cc @gregtatum

I think in the short term we can implement it and put an empty string into the CI config. Then when the new namespace is created we'll be able to retest the full reuploading of the old experiments there instead of Teklia project.

Apr 16 '24 17:04 eu9ene

Some other things we discussed should be implemented before merging:

we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards

Apologies that I missed the meeting. One possible alternative here, if you want, might be to have a separate publishing location (some sort of sandbox) that we use for these cases.

It's a good idea. We could create a staging namespace for testing, I just don't know if we want to always publish there. The namespace can be called something like moz-translations-staging and then we can specify the namespace in the config. An empty value can mean that we don't want to publish anywhere. This namespace will be our sandbox for testing before rolling out to production. If everyone is ok with this approach I'll ask the MLOps team to create it for us. cc @gregtatum

I think in the short term we can implement it and put an empty string into the CI config. Then when the new namespace is created we'll be able to retest the full reuploading of the old experiments there instead of Teklia project.

UPD: I looked at the code and I see that what W&B namespace we write to is controlled by the secret token, so to implement this we would need to deal with secrets. Let's leave it out of scope for this PR and just make sure we can disable publishing from the experiment config.

Apr 16 '24 17:04 eu9ene

Publication is enabled for all training tasks now
CI is now :ok: again, I had to add a check to not trigger publication during unit tests
Weight & Biases project, group and run are now set using TC info & languages
I investigated why the validation metrics were not published and found out this is a regression due to different format for Marian 1.12 (which is currently in-use). The parser was initially targetting 1.10 and did not support the perplexity metric. @vrigal added support for that a while back, but this metric is not always present

For example in this task we see the [valid] lines that are parsed:

[task 2024-04-19T15:12:42.196Z] [2024-04-19 15:12:42] [valid] Ep. 1 : Up. 50 : chrf : 0.207837 : new best
[task 2024-04-19T15:12:45.034Z] [2024-04-19 15:12:45] [valid] Ep. 1 : Up. 50 : ce-mean-words : 7.22975 : new best
[task 2024-04-19T15:16:29.458Z] [2024-04-19 15:16:29] [valid] Ep. 1 : Up. 50 : bleu-detok : 0 : stalled 6 times (last best: 0)

No perplexity found, so this check fails.

@eu9ene Is this expected behaviour ? Can we set perplexity as optional ?

Apr 19 '24 18:04 La0

@eu9ene Is this expected behaviour ? Can we set perplexity as optional ?

@La0 All the metrics that are used for validation are set in the Marian config (--valid-metrics), so they are all kind of optional. We usually don't use perplexity. We set them here https://github.com/mozilla/firefox-translations-training/blob/145a84ace322184b5c5d6e9aca0ececb037c5d08/pipeline/train/train.sh#L40

Apr 19 '24 22:04 eu9ene

Last changes:

I made perplexity optional so that validation logs are correctly parsed
A training parameter wandb-publication has been added, disabled by default for CI tasks. You can control it when starting a new training

Apr 22 '24 15:04 La0

I launched a test training run from this branch https://firefox-ci-tc.services.mozilla.com/tasks/SngAEYS-RyWnY4GgKJTzSQ

Apr 24 '24 02:04 eu9ene

I launched a test training run from this branch https://firefox-ci-tc.services.mozilla.com/tasks/SngAEYS-RyWnY4GgKJTzSQ

I do not see any training task in the group, maybe the cache did not allow your tasks to run ?

Apr 24 '24 08:04 La0

I launched a test training run from this branch https://firefox-ci-tc.services.mozilla.com/tasks/SngAEYS-RyWnY4GgKJTzSQ

I do not see any training task in the group, maybe the cache did not allow your tasks to run ?

It's running https://firefox-ci-tc.services.mozilla.com/tasks/PDAfVzlWQMam4uqhndjafg. You're probably looking at the actions group which is different from the tasks group. Also, I ran it from a push task.

It looks working! https://wandb.ai/moz-translations/en-ru/groups/test_tracking_SngAEYS-RyWnY4GgKJTzSQ

Apr 24 '24 15:04 eu9ene

I rebased on latest main commits, and removed the extra commit that tweaked training parameters.

Apr 25 '24 07:04 La0

Ok, it looks good. If we find more issues or have more ideas on what to improve let's handle them in the follow-up tasks. We'll need to reupload the experiments anyway when it's feature complete. We need to make sure the publishing from CI is disabled though not to pollute the ru-en project.

Apr 29 '24 17:04 eu9ene

firefox-translations-training firefox-translations-training copied to clipboard

Taskcluster publication

firefox-translations-training
firefox-translations-training copied to clipboard