understanding-ai
understanding-ai copied to clipboard

Published 20 hours ago •

Reame
Issues

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual transfer and Beyond

Open flrngel opened this issue 6 years ago • 0 comments

http://arxiv.org/abs/1812.10464

Notes

Model learn joint multilingual sentence representations

Pre-training techniques

using Moses tools

punctuation normalization
non printing chars removal
Tokenization

Paper summary

this paper introduces new dataset using tatoeba for language similarity search
- covers 122 languages
- consists 1,000 English-aligned sentence pairs for each language
enables to use sentence embedding with English annotated only, and transfer to other languages without any modifications

Jan 10 '19 19:01 flrngel