COVID-QA icon indicating copy to clipboard operation
COVID-QA copied to clipboard

Multilingual IR with Machine-Translated FAQ

Open stedomedo opened this issue 4 years ago • 9 comments

Building multilingual models (zero-shot, transfer learning, etc.) takes time.

So, in the meantime, as stated in #2 , we could machine-translate FAQs from English into other languages and add them to the search cluster, so that they can be retrieved at foreign language input. Translations in the background don't need to be perfect, but sufficient for retrieval (adequacy before fluency/grammar).

TODOs:

  • [x] Scrape the English FAQ from data/scrapers repo
  • [x] Build machine-translator tool (e.g. with https://pypi.org/project/googletrans/)
  • [ ] Translate some samples to check quality
  • [ ] Translate all English FAQ
  • [ ] Add data to ESC with columns: language, original_english_doc, is_machine_translated

stedomedo avatar Mar 22 '20 12:03 stedomedo

Great idea. @stedomedo! Did I get this right, that we would still need language-specific models for question similarity with this approach?

Would it be an alternative to translate the user question live to English and then to the matching with our FAQs? With that approach we could easily leverage English models for question similarity.

tholor avatar Mar 22 '20 12:03 tholor

Yes, that's an option. Query translation quality could suffer though from short lengths. I'm currently exploring translation quality. Thanks!

stedomedo avatar Mar 22 '20 12:03 stedomedo

The googletrans lib does not work reliably, so I made a free trial account on MS Azure, also because they offer up to 2M characters of translation for free per month.

Here is the English FAQ data including columns for Arabic: https://github.com/stedomedo/COVID-QA/blob/auto_translators/data/faqs/MT_ar_faq_covidbert.csv

stedomedo avatar Mar 22 '20 14:03 stedomedo

And the MS translator: https://github.com/stedomedo/COVID-QA/blob/auto_translators/data/translators/ms_translate.py

MS Translator is supposed to be quite good for Arabic. For other languages, Google or DeepL are better options (afaik they don't offer free credits)

Checking which real-time translation option is best to use, incl. budget-wise

stedomedo avatar Mar 22 '20 14:03 stedomedo

@tholor @Timoeller I have a question on the (desired) search workflow. Is it: user query -> match query to question with BERT -> search with elastic (tfidf, bm25) ?

So could a multilingual workflow be like this: query -> detect lang -> if EN -> match query to question with BERT -> search with elastic (tfidf, bm25) -> if AR -> search directly with elastic (tfidf, bm25) ? In this case, no multilingual-BERT or other-language-BERT or real-time translation would be needed.

stedomedo avatar Mar 22 '20 14:03 stedomedo

Good points.

Can you create a PR with the translation and the script for doing so? I would merge it to have this functionality in the repo.

About the language detection and the switch between bert + ES and only ES: we could implement it this way if multilingual isnt working well for other languages.

Do you have experience with language detection and could write a script for this, so we can integrate this into the backend? We need lang detection there anyways, because we want to adjust output texts like "source" "cateogry" etc. The script should be rather efficient, since this will limit response time...

Timoeller avatar Mar 22 '20 15:03 Timoeller

One idea for "simple" transfer learning: In Machine Translation this technique is commonly used when you have a low resource language. Basically, you build a model for language Y on top of the model for language X by just continuing the training (1-2 epochs) with the Y language data. Vocabs would need to be pooled on all languages though. This could work for small data sizes or/and maybe also machine-translated texts.

stedomedo avatar Mar 23 '20 13:03 stedomedo

That is exactly the idea! : ) With multilingual models like mBert or XLM-R this "zero shot learning" is easily possible because the vocab is already in one pool for all supported languages. See e.g. Table 1 or 3 in XLM-R paper for zero shot transfer.

So if we train a multilingual model in Sentence Bert on Quora, we will also be able to match all other languages - hopefully with good performance :dancer:

Timoeller avatar Mar 23 '20 18:03 Timoeller

You are probably aware of these datasets but heres some multilingual similarity data. I have a NMT model for english->swedish if you want me to I could NMT and add some data for better performance on scandinavian languages.

https://github.com/google-research-datasets/paws https://www.nyu.edu/projects/bowman/xnli/

ViktorAlm avatar Mar 24 '20 13:03 ViktorAlm