Open-Assistant
Open-Assistant copied to clipboard
Automatic data translator
Context
The objective of auto_translate module is enhancing the amount of data by enriching each language with translations from data written in other languages.
This is added as a separate folder on the root path as it doesn't belong neither to front end nor to backend. It is written in python.
MBART translator
mbart_translator.py contains a class called MBartTranslator that allows multilanguages translation. It uses the MBART-50 manu to many model fron facebook to convert textfrom a language to another
I have already built the MBartTranslator class in auto_translate/mbart_translator.py I have also added an example code that shows how it translates from a non english language to a non english language or from english to other languages.
The code supports 50 languages : Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI)
The model performance is not perfect. It tends to work better from and to english than between two other languages.
We can also use other models available on Hugging face by just constructing MBartTranslator with the model path (it will be automatically downloaded)
TODO
There are still steps to do to make this module work
Now, the interesting part is what comes next. I am not yet familiar with the database structure, so I am asking if someone can do this in my place:
We create a special user called translator. This user is used when creating new prompts or answers using translation, it should be excluded from the leaderboard.
The idea is that before translating a text, the translator user verifies that this text was not written by him. Other wize we can fall into infinite loops.
So it just scans the database for non translated messages, and translate each one to all other languages using the code I've added then post it with him as the writer.
This should be executed periodically say every night for example. This allows us to know what was written by real users and what was written by the translator. Translations can be criticised by people using the classical rating tools.
I need a wizard who knows the database to do that.
Any one up for the challenge?
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
This PR looks very similar to the changes to the Arabic localizations. Would it be possible to separate the two PRs to be more focused?
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
Sorry, this project is moving so fast. While I was doing my upgrades, people did update the arabic locale. And I like their translations so I have accepted the upcoming code.
It is clear that the community quality is better than any current automated translator, but if we give a start with the translator than give the community the possibility to upgrade the output, this could work and we could converge to a very solid high quality translated text that should boost our database.
Mr
Hi, can someone review my PR? I think it was pending for long. The point of this PR is the automatic translation tool part. The localization conflicts came because I had another pending PR and things got a little bit mixed.
Can someone tell me if my idea is worth it. If you want me to carry on with the rest of the plan or should I stop?
@ParisNeo Has this solution been discussed in an issue with the team members before? Our plan was to collect manually created datasets to fine tune a model, I am not really sure if automatic translation produces good quality output.
In any case, if there is no issue for this suggestion, please create one, and let the discussion continue there.
Thank you for answering. I have submitted an issue on this before starting to code: #1135
Even if we aren't interested in auto-translating data, this approach could be pretty useful for translating the UI. Like a commit hook that automatically adds missing values to all the other i8n json files when someone adds a new string to the English version. Even if the translations aren't always good, they'd likely cut down on the churn of fixing missing values and decrease the cost of UI changes.
I have already done this to another project. I have done a github action that automatically translates a csv file containing the data I was adding into other languages automatically. Even if this is not perfect, it is helpful as I just needed to change one or two things manually and my whole application was ready to work in multiple languages.
But I still think the idea of translating prompts and answers to other languages may be interesting. I had this idea when I was adding french prompts. I said to myself, I should do this also in english as I found the prompt I have written interesting wherever the language. I also saw that the arab language was lacking behind as there were no much prompts and most of them are made by Emirate people so it is really centered around data about Emarates.
Not that I have anything against that, but the data in arabic is really biased and this is due to the fact that the people that happen to speak arabic and are part of this project are not representing the full spectrum of arabic language.
In english, I find a more diverse and ritch data. So I said to mytself that auto translation can enritch other languages. And even though this is not perfect, prople will be able to rate the prompts, so this will self regulate over time.
OK we can close it. I have another pending pull request for the automated translation part here