BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

need some ideas

Open babytdream opened this issue 10 months ago • 4 comments

I am using the BERTopic framework for my tasks, but my data is increasing daily, and I want to perform analysis periodically. Do you have any suggestions? Thank you.

babytdream avatar Feb 08 '25 02:02 babytdream

Same issue here, looking for best practices on a similar use case, topic modeling on a dataset that increase over time and wanna avoid re-analyze hyperparameters every months

DamienBukudjian avatar Feb 10 '25 14:02 DamienBukudjian

I would generally advise using the merge_models functionality for this as it allows for training new models and iteratively merging them. This would also make it a bit more flexible for different types of models (parameter-wise) to be merged.

MaartenGr avatar Feb 11 '25 08:02 MaartenGr

Thank you Maarten, do you advise any specific hyperparameters for that kind of use case ? I mean to avoid having to rework it frequently and let it live. It can be tricky due to high volatility of umap

DamienBukudjian avatar Feb 11 '25 09:02 DamienBukudjian

In my experience, I seldom have to change the parameters of UMAP to get the kind of dimensionality reduction that I need. The only reason to do so if the datasize would change drastically (from millions to hundreds) but in those cases HDBSCAN is more finicky to control than UMAP. With HDBSCAN it is often about tuning min_cluster_size.

MaartenGr avatar Feb 12 '25 09:02 MaartenGr