zilch42
zilch42
I thought about that, but your KeyBERT example doesn't use `min_df` at all. Every topic should have words in that example. I wonder if it is that there are docs...
Ok, I've figured it out. The docs inside BERTopic get cleaned internally by `_preprocess_text()` before being tokenized, so by creating a vocabulary outside of BERTopic, even if it is created...
Sure, most docs in newsgroups have at least one example but try doc[8]. It has a few ```python from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer import numpy as np...
I didn't necessarily see it as I bug for my use case. I like the preprocessing, and I wouldn't necessarily want things like "couldn't" being transformed to ["couldn", "t"] which...
Thanks Maarten, I'm about to finish up for the year but if this is still open min January I'll submit one then
+1 from me!
Thanks Maarten, That's more or less what I'm doing at the moment, except that zeroshot doesn't actually assign the probabilities so `topic_model.probabilities_` is `nan` so I'm recalculating the zeroshot topic...
Thanks Maarten. I look forward to those developments. I initially had a look at what's going on in `visualize_hierarchical_documents` and couldn't make much sense of it but if I do...
Thanks @trangdata , glad to know there is a method for getting at the data. It would be intuitive in my mind to flatten to the lowest possible level and...
Thanks @trangdata. The updated documentation is definitely clearer