Greg Tatum

Results 373 comments of Greg Tatum

I wonder if we can load in dictionaries where it's literally one word to one word.

Or maybe even synthesize it with the alignment data.

This behavior is also visible with numbers. A good example is to do a list of numbers.

Verify the fix with: https://bugzilla.mozilla.org/show_bug.cgi?id=1888972

Here is a word count distribution for the merged corpus sl-en: https://firefox-ci-tc.services.mozilla.com/tasks/groups/PPCzZRHaTT6Ys4BIhPGT5w ![word count distribution "en"](https://github.com/user-attachments/assets/12c42a05-5fdc-4963-9b36-c887984edeae) ![word count distribution "sl"](https://github.com/user-attachments/assets/d0eb3c3b-daf3-4ed5-8e67-fd1052a06a9e) Generated via: ``` python3 pipeline/data/analyze.py --file_location https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VK5zmxJRTLy0y0WBQ0DRJg/artifacts/public/build/corpus.en.zst --output data --dataset...

I filed #878 which suggests augmenting with statistically synthesized single word translations.

I filed #879 which suggests harvesting short sentences from parallel datasets.

I filed #880 which suggests statistically synthesizing short sentence translations from monolingual data sources.

Some of the Taskcluster-specific ones I'm looking at migrating to Taskcluster, so it definitely won't be all of them. And some were just me trying messing around to learn things.

The work here is to also figure out how to do the initial deploy for it and have the docs co-exist. I'm not sure how that will work yet.