Open-Assistant

Open-Assistant copied to clipboard

Reame
Issues

Automatic data translator

Open ParisNeo opened this issue 2 years ago • 15 comments

Context

The objective of auto_translate module is enhancing the amount of data by enriching each language with translations from data written in other languages.

This is added as a separate folder on the root path as it doesn't belong neither to front end nor to backend. It is written in python.

MBART translator

mbart_translator.py contains a class called MBartTranslator that allows multilanguages translation. It uses the MBART-50 manu to many model fron facebook to convert textfrom a language to another

I have already built the MBartTranslator class in auto_translate/mbart_translator.py I have also added an example code that shows how it translates from a non english language to a non english language or from english to other languages.

The code supports 50 languages : Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI)

The model performance is not perfect. It tends to work better from and to english than between two other languages.

We can also use other models available on Hugging face by just constructing MBartTranslator with the model path (it will be automatically downloaded)

TODO

There are still steps to do to make this module work

Now, the interesting part is what comes next. I am not yet familiar with the database structure, so I am asking if someone can do this in my place:

We create a special user called translator. This user is used when creating new prompts or answers using translation, it should be excluded from the leaderboard.

The idea is that before translating a text, the translator user verifies that this text was not written by him. Other wize we can fall into infinite loops.

So it just scans the database for non translated messages, and translate each one to all other languages using the code I've added then post it with him as the writer.

This should be executed periodically say every night for example. This allows us to know what was written by real users and what was written by the translator. Translations can be criticised by people using the classical rating tools.

I need a wizard who knows the database to do that.

Any one up for the challenge?

Feb 09 '23 16:02 ParisNeo

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 09 '23 16:02 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 09 '23 16:02 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 09 '23 16:02 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 09 '23 16:02 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 09 '23 16:02 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 09 '23 16:02 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 09 '23 16:02 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 09 '23 16:02 github-actions[bot]

This PR looks very similar to the changes to the Arabic localizations. Would it be possible to separate the two PRs to be more focused?

Feb 10 '23 04:02 fozziethebeat

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 10 '23 06:02 github-actions[bot]

Sorry, this project is moving so fast. While I was doing my upgrades, people did update the arabic locale. And I like their translations so I have accepted the upcoming code.

It is clear that the community quality is better than any current automated translator, but if we give a start with the translator than give the community the possibility to upgrade the output, this could work and we could converge to a very solid high quality translated text that should boost our database.

Feb 10 '23 06:02 ParisNeo

Mr

Feb 10 '23 20:02 oleole7000

Hi, can someone review my PR? I think it was pending for long. The point of this PR is the automatic translation tool part. The localization conflicts came because I had another pending PR and things got a little bit mixed.

Can someone tell me if my idea is worth it. If you want me to carry on with the rest of the plan or should I stop?

Feb 11 '23 00:02 ParisNeo

@ParisNeo Has this solution been discussed in an issue with the team members before? Our plan was to collect manually created datasets to fine tune a model, I am not really sure if automatic translation produces good quality output.

In any case, if there is no issue for this suggestion, please create one, and let the discussion continue there.

Feb 12 '23 06:02 AbdBarho

Thank you for answering. I have submitted an issue on this before starting to code: #1135

Feb 12 '23 10:02 ParisNeo

Even if we aren't interested in auto-translating data, this approach could be pretty useful for translating the UI. Like a commit hook that automatically adds missing values to all the other i8n json files when someone adds a new string to the English version. Even if the translations aren't always good, they'd likely cut down on the churn of fixing missing values and decrease the cost of UI changes.

Feb 12 '23 18:02 bitplane

I have already done this to another project. I have done a github action that automatically translates a csv file containing the data I was adding into other languages automatically. Even if this is not perfect, it is helpful as I just needed to change one or two things manually and my whole application was ready to work in multiple languages.

Feb 12 '23 19:02 ParisNeo

But I still think the idea of translating prompts and answers to other languages may be interesting. I had this idea when I was adding french prompts. I said to myself, I should do this also in english as I found the prompt I have written interesting wherever the language. I also saw that the arab language was lacking behind as there were no much prompts and most of them are made by Emirate people so it is really centered around data about Emarates.

Not that I have anything against that, but the data in arabic is really biased and this is due to the fact that the people that happen to speak arabic and are part of this project are not representing the full spectrum of arabic language.

In english, I find a more diverse and ritch data. So I said to mytself that auto translation can enritch other languages. And even though this is not perfect, prople will be able to rate the prompts, so this will self regulate over time.

Feb 12 '23 19:02 ParisNeo

OK we can close it. I have another pending pull request for the automated translation part here

Feb 13 '23 14:02 ParisNeo