Top2Vec icon indicating copy to clipboard operation
Top2Vec copied to clipboard

Using Sentence Transformer Models such as LaBSE or Huggingface MuRIL

Open reichenbch opened this issue 3 years ago • 1 comments

How can I use these embedding models in Topic2Vec for Pretrained Embedding. As this library support sentence-transformers, how can I use them ?

Also, how can I use other huggingface models for embedding generation. If I download a pretrained model and write a callable function for generating embedding, can I use it as an embedding module ?

reichenbch avatar May 14 '22 15:05 reichenbch

Hi @reichenbch

I have LaBSE working, although I'm not sure if its the most efficient method. I append LaBSE label to sbert_models (in top2vec.py) and then:

from top2vec import Top2Vec model = Top2Vec(documents, embedding_model='LaBSE', use_embedding_model_tokenizer=True)

I have not tried MuRIL. Perhaps append MuRIL label to 'use_models' + equivalent at 'use-urls'? And this looks promising:

from sentence_transformers import SentenceTransformer SentenceTransformer('google/muril-base-cased')

Thus, I'm guessing something like this may also work:

from top2vec import Top2Vec model = Top2Vec(documents, embedding_model='google/muril-base-case', use_embedding_model_tokenizer=True)

iamdank avatar May 27 '22 05:05 iamdank

Top2Vec allows embedding_model to be a string or callable. So currently if your model of choice is not in the string options you can just pass it as a callable.

ddangelov avatar Nov 13 '22 21:11 ddangelov