firefox-translations-training
firefox-translations-training copied to clipboard
Taskcluster publication
Refs #333
Got a training task definition with
- taskcluster proxy enabled
- tracking code install through pip
- taskcluster secret env variable set
I compared the new train-teacher log form CI with the previous one. The wrapper should not modify the original log. It should stream stdout and stderr of the wrapped program as is and optionally add some logging records with the same convention. For example: [tracking] [INFO] Reading logs stream.
I wonder if this is related to my comment about stdout & stderr getting combined? I imagine that separating those (and handling both streams) might result in better behaviour. Even so - I'm not certain whether or not a downstream consumer can guarantee that they will process messages from two separate pipes in the same order they would've been printed to the console. (Maybe it is possible...I honestly don't know.)
I compared the new train-teacher log form CI with the previous one. The wrapper should not modify the original log. It should stream stdout and stderr of the wrapped program as is and optionally add some logging records with the same convention. For example: [tracking] [INFO] Reading logs stream.
I wonder if this is related to my comment about stdout & stderr getting combined? I imagine that separating those (and handling both streams) might result in better behaviour. Even so - I'm not certain whether or not a downstream consumer can guarantee that they will process messages from two separate pipes in the same order they would've been printed to the console. (Maybe it is possible...I honestly don't know.)
We discussed live that it's ok to merge those two streams because they are intentionally used by OpusTrainer to output stuff. I guess it was used to separate Marian log and OpusTrainer's own log. Unless there's a way to easily redirect both separately using bash piping.
Some other things we discussed should be implemented before merging:
- we should investigate warnings with validation publishing. Indeed I don't see validation graphs on https://wandb.ai/moz-translations/moz-translations
- we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards
- as in the previous comment, preserving log format and adding [tracking] prefix for the parser
- using the proper project/experiment name for publishing right away as we'll start using this when it's merged (should be
ru-en/ci_<task_cluster_id>
for the test CI run when enabled - (from the previous comment) making sure all the training and fine-tuning steps are tracked as we'll start using it for real experiments
Some other things we discussed should be implemented before merging: * we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards
Apologies that I missed the meeting. One possible alternative here, if you want, might be to have a separate publishing location (some sort of sandbox) that we use for these cases.
Some other things we discussed should be implemented before merging:
- we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards
Apologies that I missed the meeting. One possible alternative here, if you want, might be to have a separate publishing location (some sort of sandbox) that we use for these cases.
It's a good idea. We could create a staging namespace for testing, I just don't know if we want to always publish there. The namespace can be called something like moz-translations-staging
and then we can specify the namespace in the config. An empty value can mean that we don't want to publish anywhere. This namespace will be our sandbox for testing before rolling out to production. If everyone is ok with this approach I'll ask the MLOps team to create it for us. cc @gregtatum
I think in the short term we can implement it and put an empty string into the CI config. Then when the new namespace is created we'll be able to retest the full reuploading of the old experiments there instead of Teklia project.
Some other things we discussed should be implemented before merging:
- we should add a parameter to the training config to disable publishing on CI and when we debug something and don't want to pollute the dashboards
Apologies that I missed the meeting. One possible alternative here, if you want, might be to have a separate publishing location (some sort of sandbox) that we use for these cases.
It's a good idea. We could create a staging namespace for testing, I just don't know if we want to always publish there. The namespace can be called something like
moz-translations-staging
and then we can specify the namespace in the config. An empty value can mean that we don't want to publish anywhere. This namespace will be our sandbox for testing before rolling out to production. If everyone is ok with this approach I'll ask the MLOps team to create it for us. cc @gregtatumI think in the short term we can implement it and put an empty string into the CI config. Then when the new namespace is created we'll be able to retest the full reuploading of the old experiments there instead of Teklia project.
UPD: I looked at the code and I see that what W&B namespace we write to is controlled by the secret token, so to implement this we would need to deal with secrets. Let's leave it out of scope for this PR and just make sure we can disable publishing from the experiment config.
- Publication is enabled for all training tasks now
- CI is now :ok: again, I had to add a check to not trigger publication during unit tests
- Weight & Biases project, group and run are now set using TC info & languages
- I investigated why the validation metrics were not published and found out this is a regression due to different format for Marian 1.12 (which is currently in-use). The parser was initially targetting 1.10 and did not support the
perplexity
metric. @vrigal added support for that a while back, but this metric is not always present
For example in this task we see the [valid]
lines that are parsed:
[task 2024-04-19T15:12:42.196Z] [2024-04-19 15:12:42] [valid] Ep. 1 : Up. 50 : chrf : 0.207837 : new best
[task 2024-04-19T15:12:45.034Z] [2024-04-19 15:12:45] [valid] Ep. 1 : Up. 50 : ce-mean-words : 7.22975 : new best
[task 2024-04-19T15:16:29.458Z] [2024-04-19 15:16:29] [valid] Ep. 1 : Up. 50 : bleu-detok : 0 : stalled 6 times (last best: 0)
No perplexity
found, so this check fails.
@eu9ene Is this expected behaviour ? Can we set perplexity
as optional ?
@eu9ene Is this expected behaviour ? Can we set
perplexity
as optional ?
@La0 All the metrics that are used for validation are set in the Marian config (--valid-metrics), so they are all kind of optional. We usually don't use perplexity. We set them here https://github.com/mozilla/firefox-translations-training/blob/145a84ace322184b5c5d6e9aca0ececb037c5d08/pipeline/train/train.sh#L40
Last changes:
- I made
perplexity
optional so that validation logs are correctly parsed - A training parameter
wandb-publication
has been added, disabled by default for CI tasks. You can control it when starting a new training
I launched a test training run from this branch https://firefox-ci-tc.services.mozilla.com/tasks/SngAEYS-RyWnY4GgKJTzSQ
I launched a test training run from this branch https://firefox-ci-tc.services.mozilla.com/tasks/SngAEYS-RyWnY4GgKJTzSQ
I do not see any training task in the group, maybe the cache did not allow your tasks to run ?
I launched a test training run from this branch https://firefox-ci-tc.services.mozilla.com/tasks/SngAEYS-RyWnY4GgKJTzSQ
I do not see any training task in the group, maybe the cache did not allow your tasks to run ?
It's running https://firefox-ci-tc.services.mozilla.com/tasks/PDAfVzlWQMam4uqhndjafg. You're probably looking at the actions group which is different from the tasks group. Also, I ran it from a push task.
It looks working! https://wandb.ai/moz-translations/en-ru/groups/test_tracking_SngAEYS-RyWnY4GgKJTzSQ
I rebased on latest main
commits, and removed the extra commit that tweaked training parameters.
Ok, it looks good. If we find more issues or have more ideas on what to improve let's handle them in the follow-up tasks. We'll need to reupload the experiments anyway when it's feature complete. We need to make sure the publishing from CI is disabled though not to pollute the ru-en project.