BERTopic
BERTopic copied to clipboard
Update _bertopic.py to fix question/ github issue #1696
As discussed in https://github.com/MaartenGr/BERTopic/issues/1696, I provide an updated doc string to reflect that topic_model.transform(docs)[0][i]
is sometimes different from topic_model.transform(docs[i])[0][0]
Thanks for this PR! Could you rephrase the following a bit:
(especially when using the HDBSCAN algorithm)
This makes it seems that this behavior is across many different algorithms when in reality this is HDBSCAN-specific behavior.
Sure! Do you have a suggestion for a specific wording?
I am currently lacking the fantasy for other ways to express the fact that HDBSCAN is responsible here while we could also have a pipeline without HDBSCAN (but another component which may or may not behave similarly)
Sure! Do you have a suggestion for a specific wording?
I am currently lacking the fantasy for other ways to express the fact that HDBSCAN is responsible here while we could also have a pipeline without HDBSCAN (but another component which may or may not behave similarly)
You could do something like this: "A single document or a list of documents to predict the topic(s) for. NOTE: When using HDBSCAN, the prediction might differ depending on whether a single document or a list of documents is passed since it leverages the data points of other documents".
I think it's best to stay close to the original documentation and inner workings of HDBSCAN. I believe this and this resource are relevant from the top of my head.
Also, a small tip. ChatGPT works wonders for helping with these kinds of issues ;)