Evgeny Pavlov comments

Results 112 comments of


                                            Evgeny Pavlov

Tune workspace dashboard to enable comparison across models and experiments

If W&B doesn't allow this, the workaround would be to rename runs with a suffix with experiment name so that they are unique. For example `student-finetuned_opustrainer`.

Tune workspace dashboard to enable comparison across models and experiments

We discussed that W&B uses display names to show charts and since those names are not unique it doesn't show the runs properly on the charts. We discussed the following...

Incompatible python version

We don't have any issues with our current workflows. The supported Python versions are specified in poetry config and Docker images: https://github.com/mozilla/firefox-translations-training/blob/04e9e9cdc369cc8efdf080d57eef805a61d2c35e/pyproject.toml#L9

Override default opus-trainer config from experiment config

Those setting would also be visible in the experiment config and simplify analysis of the experiments.

Evaluate translation capabilities of LLMs

@marco-c FYI since you already started some work on this. I wanted to run an evaluation using our tools at some point.

Evaluate translation capabilities of LLMs

> See also https://arxiv.org/pdf/2302.14520.pdf. This one is on my list :) ["A PARADIGM SHIFT IN MACHINE TRANSLATION: BOOSTING TRANSLATION PERFORMANCE OF LARGE LANGUAGE MODELS"](https://arxiv.org/pdf/2309.11674.pdf) is another interesting one

Evaluate translation capabilities of LLMs

WMT23: https://aclanthology.org/2023.wmt-1.1.pdf

Evaluate translation capabilities of LLMs

I analyzed the results of https://arxiv.org/pdf/2302.09210.pdf and https://arxiv.org/pdf/2309.11674.pdf and also benchmarked [ALMA-13B-LoRA ](https://huggingface.co/haoranxu/ALMA-13B)myself. The quality is pretty good and looks on par with Google API for xx-en and slightly worse...

Evaluate translation capabilities of LLMs

Sure, [here](https://docs.google.com/spreadsheets/d/1O77Ap0zA5xMw0gbzfLBDPLKaM2zAAVgFeN8TF227fbg/edit?usp=sharing) it is but that's basically it. I wanted to also benchmark for WM23 but didn't have time for it. For the out-of-the-envelope calculations: for [this mono task](https://firefox-ci-tc.services.mozilla.com/tasks/K1wuw2XQRPu4SfwJyWf5mQ/runs/0/logs/public/logs/live.log) we...

Evaluate translation capabilities of LLMs

The empty cells for some languages are where the model failed to follow the prompt and translate all the examples, so like with all other LLM tasks it's not 100%...