[Question]: stack of multi-forward,multi-backward , and multi gives very low cos similarity
Question
first off, thank you for fantastic lean library,
I'm trying to replace bert-base-multilingual-uncased TransformerWordEmbeddings with flair's as @alanakbik suggests here https://github.com/flairNLP/flair/issues/1518#issuecomment-625321847 like this
import flair
print(flair.__version__)
from flair.embeddings import (
StackedEmbeddings,
FlairEmbeddings,
BytePairEmbeddings,
TransformerWordEmbeddings,
WordEmbeddings,
)
from flair.data import Sentence
# https://github.com/flairNLP/flair/issues/1518#issuecomment-625321847 (2020)
# why does this gives such low cos?
embeddings = StackedEmbeddings([
FlairEmbeddings('multi-forward'),
FlairEmbeddings('multi-backward'),
BytePairEmbeddings('multi'),
])
# recommended https://flairnlp.github.io/docs/tutorial-embeddings/flair-embeddings#recommended-flair-usage
# but English only
# embeddings = StackedEmbeddings(
# [
# WordEmbeddings("glove"),
# FlairEmbeddings("news-forward"),
# FlairEmbeddings("news-backward"),
# ]
# )
import torch
print(torch.__version__)
import torch.nn as nn
cos = nn.CosineSimilarity(dim=0)
sent_en = Sentence("To take advantage of this offer please visit Istanbul Airport")
embeddings.embed(sent_en)
sent_es = Sentence("Para aprovechar esta oferta, visite el aeropuerto de Estambul")
embeddings.embed(sent_es)
# offer
en_tensor = sent_en[5].embedding
print(sent_en[5]), en_tensor.shape
# oferta
es_tensor = sent_es[3].embedding
print(sent_es[3])
print(cos(en_tensor, es_tensor) )
but that yields very low similarity
0.15.1
Setting dim=300 for multilingual BPEmb
2.9.0
Token[5]: "offer"
Token[3]: "oferta"
tensor(0.1605)
whereas same code but with
embeddings = TransformerWordEmbeddings('bert-base-multilingual-uncased', fine_tune=False, layers='all') # , use_scalar_mix=True),
yields
0.15.1
2.9.0
Token[5]: "offer"
Token[3]: "oferta"
tensor(0.7702)
Hello @wildfluss thats a good question. There's likely two factors at play here: very important is the amount of multilingual text data used to train these models. You could try some examples with a bigger transformer-based model such as xlm-roberta-large, which I would recommend for such use cases. The multilingual flair models are older and RNN-based, and so haven't been trained over as much data as the transformer models.
The other factor is that the flair models are character-level, whereas the transformer-based models are at the subword level. Character-level models likely learn more language-specific surface-level features which probably don't transfer too well across languages.