[Question]: stack of multi-forward,multi-backward , and multi gives very low cos similarity

Open wildfluss opened this issue 1 month ago • 1 comments

Question

first off, thank you for fantastic lean library,

I'm trying to replace bert-base-multilingual-uncased TransformerWordEmbeddings with flair's as @alanakbik suggests here https://github.com/flairNLP/flair/issues/1518#issuecomment-625321847 like this

import flair 
print(flair.__version__)

from flair.embeddings import (
    StackedEmbeddings,
    FlairEmbeddings,
    BytePairEmbeddings,
    TransformerWordEmbeddings,
    WordEmbeddings,
)
from flair.data import Sentence

# https://github.com/flairNLP/flair/issues/1518#issuecomment-625321847 (2020)
# why does this gives such low cos?
embeddings = StackedEmbeddings([
    FlairEmbeddings('multi-forward'),
    FlairEmbeddings('multi-backward'),
    BytePairEmbeddings('multi'),
])

# recommended https://flairnlp.github.io/docs/tutorial-embeddings/flair-embeddings#recommended-flair-usage
# but English only 
# embeddings = StackedEmbeddings(
#     [
#         WordEmbeddings("glove"),
#         FlairEmbeddings("news-forward"),
#         FlairEmbeddings("news-backward"),
#     ]
# )

import torch 
print(torch.__version__)

import torch.nn as nn

cos = nn.CosineSimilarity(dim=0)

sent_en = Sentence("To take advantage of this offer please visit Istanbul Airport")
embeddings.embed(sent_en)

sent_es = Sentence("Para aprovechar esta oferta, visite el aeropuerto de Estambul")
embeddings.embed(sent_es)

# offer 
en_tensor = sent_en[5].embedding
print(sent_en[5]), en_tensor.shape

# oferta
es_tensor = sent_es[3].embedding
print(sent_es[3])

print(cos(en_tensor, es_tensor) )

but that yields very low similarity

0.15.1
Setting dim=300 for multilingual BPEmb
2.9.0
Token[5]: "offer"
Token[3]: "oferta"
tensor(0.1605)

whereas same code but with

embeddings = TransformerWordEmbeddings('bert-base-multilingual-uncased', fine_tune=False, layers='all') # , use_scalar_mix=True),

yields

0.15.1
2.9.0
Token[5]: "offer"
Token[3]: "oferta"
tensor(0.7702)

Oct 29 '25 15:10 wildfluss

Hello @wildfluss thats a good question. There's likely two factors at play here: very important is the amount of multilingual text data used to train these models. You could try some examples with a bigger transformer-based model such as xlm-roberta-large, which I would recommend for such use cases. The multilingual flair models are older and RNN-based, and so haven't been trained over as much data as the transformer models.

The other factor is that the flair models are character-level, whereas the transformer-based models are at the subword level. Character-level models likely learn more language-specific surface-level features which probably don't transfer too well across languages.

Oct 29 '25 15:10 alanakbik