zilch42 comments

Results 22 comments of


                                            zilch42

Words not in CountVectorizer vocab despite being well above min_df threshold

I thought about that, but your KeyBERT example doesn't use `min_df` at all. Every topic should have words in that example. I wonder if it is that there are docs...

Words not in CountVectorizer vocab despite being well above min_df threshold

Ok, I've figured it out. The docs inside BERTopic get cleaned internally by `_preprocess_text()` before being tokenized, so by creating a vocabulary outside of BERTopic, even if it is created...

Words not in CountVectorizer vocab despite being well above min_df threshold

Sure, most docs in newsgroups have at least one example but try doc[8]. It has a few ```python from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer import numpy as np...

Words not in CountVectorizer vocab despite being well above min_df threshold

I didn't necessarily see it as I bug for my use case. I like the preprocessing, and I wouldn't necessarily want things like "couldn't" being transformed to ["couldn", "t"] which...

logger.warning() formatting issue in topic_model.save()

Thanks Maarten, I'm about to finish up for the year but if this is still open min January I'll submit one then

Create topic tree with custom labels

+1 from me!

Request: Zeroshot option to assign unassigned documents to outliers rather than reclustering

Thanks Maarten, That's more or less what I'm doing at the moment, except that zeroshot doesn't actually assign the probabilities so `topic_model.probabilities_` is `nan` so I'm recalculating the zeroshot topic...

How to create layers for DataMapPlot

Thanks Maarten. I look forward to those developments. I initially had a look at what's going on in `visualize_hierarchical_documents` and couldn't make much sense of it but if I do...

Multiple author institutions lost from works

Thanks @trangdata , glad to know there is a method for getting at the data. It would be intuitive in my mind to flatten to the lowest possible level and...

Multiple author institutions lost from works

Thanks @trangdata. The updated documentation is definitely clearer