Carmen Heger comments

Results 10 comments of


                                            Carmen Heger

Data collection for different languages

I'll add Italian

Data collection for different languages

One way to "easily" get multilingual data is to machine-translate. `pip install googletrans` (and then use `Translator(service_urls=["translate.google.com/gen204"])`) These are older Google Translate Versions, and worse quality than prod, but it's...

Data collection for different languages

Multilingual resource can also easily be found using linguee and checking the sources of the found sentences in the language pairs, e.g. for DE: https://www.linguee.com/english-german/search?source=auto&query=coronavirus

Benchmark & improve different embedding models

For fun, just added simple BLEU scoring to the results: https://public-mlflow.deepset.ai/#/experiments/55/runs/be5705cb1ddb4326a10f262732f5bd96 (BLEU is ngram-based (1-4grams) comparison between strings, it can be done on sentence level but with counting +1 and...

Data Augmentation

There is a possibility to use PPDB to generate additional paraphrased questions: http://paraphrase.org/#/download

Multilingual IR with Machine-Translated FAQ

Yes, that's an option. Query translation quality could suffer though from short lengths. I'm currently exploring translation quality. Thanks!

Multilingual IR with Machine-Translated FAQ

The `googletrans` lib does not work reliably, so I made a free trial account on MS Azure, also because they offer up to 2M characters of translation for free per...

Multilingual IR with Machine-Translated FAQ

And the MS translator: https://github.com/stedomedo/COVID-QA/blob/auto_translators/data/translators/ms_translate.py MS Translator is supposed to be quite good for Arabic. For other languages, Google or DeepL are better options (afaik they don't offer free credits)...

Multilingual IR with Machine-Translated FAQ

@tholor @Timoeller I have a question on the (desired) search workflow. Is it: user query -> match query to question with BERT -> search with elastic (tfidf, bm25) ? So...

Multilingual IR with Machine-Translated FAQ

One idea for "simple" transfer learning: In Machine Translation [this technique]( https://www.aclweb.org/anthology/W18-6325/) is commonly used when you have a low resource language. Basically, you build a model for language Y...