Greg Tatum issues

Results 204 issues of


                                            Greg Tatum

Consolidate all of the training scripts into a main pipeline/train/train.py script

We have: ``` pipeline/train/train.sh taskcluster/scripts/pipeline/train-taskcluster.sh taskcluster/scripts/pipeline/train_taskcluster.py ``` It will be much simpler to put this into a single training script. We should probably do this after our first big training...

refactoring

Our models should be robust enough to translate a calendar

The models tend to break down when presented with single numbers, or lists of numbers. For example de-en translates ``` 1 2 3 4 5 6 7 8 9 ```...

quality

Migrate all the run_task tests into a separate folder

They have a rather unique behavior, and so it would be helpful to have them separated out. Something like: * `tests/task` * `tests/unit`

refactoring

Build alphabet support from CLDR data

In PR #157 I added additional alphabet support. This information is available by professional translators in the CLDR data: https://github.com/unicode-org/cldr-json/blob/0876ec40e13d54c0a6b6456392802d4de7e059cb/cldr-json/cldr-misc-full/main/sl/characters.json It would be nice to consume that JSON and automate...

Add a cleaning rule for URL names, such as Amazon.com -> Amazon.it

Similar to #736, we should discard sentences that translate the domain suffix of a website, like Amazon.com -> Amazon.it With a regex such as `/[a-z]+\.com\b/` we could identify a URL...

Add support for the Quarters field "Q"

As a follow-up to #481, we need support for the "Q" field, which is Quarters.

T-enhancement

help wanted

A-scope

C-datetime

S-small

Limit the amount of data used for distillation

In #771 I ran an experiment to see the effects of the size of the distillation corpus for the change in the COMET score for the students. Adding more data...

cost & perf

Investigate removing teacher ensemble training

Training a second teacher improves performance only slightly. It may be more cost efficient to take the quality hit and remove it. Comet Change | Average Type -- | --...

cost & perf

experiment

Reduce monolingual data for da-en to investigate distillation performance

An experiment for #231 da-en is one of our best models from the spring-2024 run. The teacher ensemble had a COMET score of 0.9013. The student COMET was 0.8950, with...

experiment

Run dhat or similar memory tools on a native built version of the the browsermt marian-dev fork

In Firefox the memory size of the inference engine is quite large in wasm. There aren't good memory tools to analyze the wasm. Instead, we should compile it natively, and...

inference