understanding-ai
understanding-ai copied to clipboard
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual transfer and Beyond
http://arxiv.org/abs/1812.10464
Notes
- Model learn joint multilingual sentence representations
Pre-training techniques
using Moses tools
- punctuation normalization
- non printing chars removal
- Tokenization
Paper summary
- this paper introduces new dataset using tatoeba for language similarity search
- covers 122 languages
- consists 1,000 English-aligned sentence pairs for each language
- enables to use sentence embedding with English annotated only, and transfer to other languages without any modifications