BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Is it possible to use separate word and document embedding models?

Open sliedes opened this issue 1 year ago • 6 comments

I'm wondering if it's possible to use separate models for word and document embeddings with BERTopic. Does something break if I pass it an embedding model that treats words and documents differently (essentially, embeds them in a different space, same or different dimensionality)? Let's say my documents are all more than one word, so the embedder could make the distinction. I see a few possible answers:

  1. Yes, it's possible and makes sense.
  2. It would possibly make sense, but not currently implemented (may be easy to modify BERTopic to do so, though—I'm not afraid!)
  3. No, and it's fundamentally not possible due to BERTopic requiring words and documents to be embedded in the same space (why?)

Do I understand correctly that word embeddings are essentially only used for the topic representation?

My motivation:

I'm working on a corpus of documents from a specific context. They contain things like acronyms and identifiers that are not commonly known. I would love to fine-tune some BERT model to give me customized word embeddings that capture the relationships of the vocabulary in my context.

sliedes avatar Feb 01 '24 09:02 sliedes

I'm wondering if it's possible to use separate models for word and document embeddings with BERTopic. Does something break if I pass it an embedding model that treats words and documents differently (essentially, embeds them in a different space, same or different dimensionality)? Let's say my documents are all more than one word, so the embedder could make the distinction. I see a few possible answers:

No, and this has nothing to do with BERTopic specifically but the nature of embeddings. Comparing embeddings with different dimensional space is generally not possible (with a few exceptions) and requires significant changes to how embeddings work and can be compared. For instance, try to compare embeddings with different dimensional space and you will notice the difficulties. Moreover, try passing embeddings with different dimensions to UMAP and note what happens.

I'm working on a corpus of documents from a specific context. They contain things like acronyms and identifiers that are not commonly known. I would love to fine-tune some BERT model to give me customized word embeddings that capture the relationships of the vocabulary in my context.

Fine-tuning a BERT model generally does not result in the best embedding-based representations. I advise using SBERT instead and generating sentence-level representations. That way, you do not need to use both word-and document embeddings.

MaartenGr avatar Feb 01 '24 13:02 MaartenGr

Yes, I know this. I guess my question was, does BERTopic actually compare word and document embeddings. Apparently the answer is "yes"? :)

sliedes avatar Feb 01 '24 13:02 sliedes

No, in principle BERTopic does not compare word and document embeddings. What happens is the following. Any textual input, regardless of whether they are words or documents, is embedded using an embedding model into the same dimensional space. We generally assume that the input are documents. These embeddings are then passed to the dimensionality reduction algorithm before being clustered. After clustering, the default setting is not too use the previous embeddings at all but use the c-TF-IDF algorithm instead. Therefore, word and document embeddings are not really compared to one another. Unless you consider that although the same embedding model is used for both a "word" and a "document" they generate the same sort of representation.

Also, it might be worthwhile to check out the description of the underlying algorithm of BERTopic here.

The comparison between word and document embeddings are applied in KeyBERTInspired for example.

MaartenGr avatar Feb 01 '24 13:02 MaartenGr