understanding-ai icon indicating copy to clipboard operation
understanding-ai copied to clipboard

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual transfer and Beyond

Open flrngel opened this issue 5 years ago • 0 comments

http://arxiv.org/abs/1812.10464

Notes

  • Model learn joint multilingual sentence representations image

Pre-training techniques

using Moses tools

  • punctuation normalization
  • non printing chars removal
  • Tokenization

Paper summary

  • this paper introduces new dataset using tatoeba for language similarity search
    • covers 122 languages
    • consists 1,000 English-aligned sentence pairs for each language
  • enables to use sentence embedding with English annotated only, and transfer to other languages without any modifications

flrngel avatar Jan 10 '19 19:01 flrngel