BERTopic Update _bertopic.py to fix question/ github issue #1696

Update _bertopic.py to fix question/ github issue #1696

Open jonaslandsgesell opened this issue 1 year ago • 3 comments

As discussed in https://github.com/MaartenGr/BERTopic/issues/1696, I provide an updated doc string to reflect that topic_model.transform(docs)[0][i] is sometimes different from topic_model.transform(docs[i])[0][0]

Jan 03 '24 08:01 jonaslandsgesell

Thanks for this PR! Could you rephrase the following a bit:

(especially when using the HDBSCAN algorithm)

This makes it seems that this behavior is across many different algorithms when in reality this is HDBSCAN-specific behavior.

Feb 08 '24 14:02 MaartenGr

Sure! Do you have a suggestion for a specific wording?

I am currently lacking the fantasy for other ways to express the fact that HDBSCAN is responsible here while we could also have a pipeline without HDBSCAN (but another component which may or may not behave similarly)

Feb 08 '24 15:02 jonaslandsgesell

Sure! Do you have a suggestion for a specific wording?

I am currently lacking the fantasy for other ways to express the fact that HDBSCAN is responsible here while we could also have a pipeline without HDBSCAN (but another component which may or may not behave similarly)

You could do something like this: "A single document or a list of documents to predict the topic(s) for. NOTE: When using HDBSCAN, the prediction might differ depending on whether a single document or a list of documents is passed since it leverages the data points of other documents".

I think it's best to stay close to the original documentation and inner workings of HDBSCAN. I believe this and this resource are relevant from the top of my head.

Also, a small tip. ChatGPT works wonders for helping with these kinds of issues ;)

Feb 10 '24 19:02 MaartenGr

BERTopic BERTopic copied to clipboard

Update _bertopic.py to fix question/ github issue #1696

BERTopic
BERTopic copied to clipboard