BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Composite seed words (is it supported by default or at least possible without modifications?)

Open GeorgeDeac opened this issue 1 year ago • 3 comments

Is the use of composite meaning seed words possible? Like for instance: "body dysphoria" (dysphoria by itself can mean multiple things, when associated with body is closer to a topic that I'm looking for).

If it is not supported by default, what would be the easiest way to implement this?

Would something like passing "body dysphoria", vectorizing it and including it as a vector directly into the seed list work? (by including extra logic into how seed words are handled)

Or would it be possible only with a custom tokenizer rule?

GeorgeDeac avatar Feb 04 '24 18:02 GeorgeDeac

The seed words themselves are passed in their entirety to an embedding model, so from that perspective the seed words will have a significant effect in the steering of topics. For the word vectorizer, you would have to make sure that n-grams are supported if you also want to increase their c-TF-IDF values but it is not necessary if you do not care about the weighting of the seed words themselves.

MaartenGr avatar Feb 05 '24 07:02 MaartenGr

I understand, so it should work out of the box, as far as semantic meaning of the words together is concerned. However in my case I might have a tracked topic containing a seed term which is "dysphoria" alone and another topic which contains "body dysphoria". Even more so for a topic consisting of "family" vs a topic consisting of "family issues". So in this regard, I suppose n-gram might be needed?

GeorgeDeac avatar Feb 05 '24 22:02 GeorgeDeac

Yes, you would need n-grams for the representations themselves but not for the assignment of topics since that is handled automatically.

MaartenGr avatar Feb 08 '24 14:02 MaartenGr