BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Extract information about "Super Clusters"

Open doubianimehdi opened this issue 3 years ago • 9 comments

Hi,

I have a rather unusual request , I have the following clustering : image

I was wondering if I can extract information about the "super" cluster that are located on 1.75 (X-Axis) and 2.25 (words , distance ...) so that for example researchers that are interested in zooming into the "super" cluster and have more detailed clustered and so until they reached the left part of the graphic .... I don't if that's clear ...

doubianimehdi avatar Oct 01 '21 07:10 doubianimehdi

That definitely seems like an interesting idea to explore a bit further.

Let me start off by saying that the hierarchical nature of the clusters that you see above is a rather simplified version compared to the actual reduction of topics as performed by HDBSCAN or manually. This means that there needs to be a fixed definition of the hierarchical nature of the topics.

However, in theory, this should be possible but is likely to require significant coding on my side. This unfortunately means that it could be a while before it is implemented.

Awesome idea though and I'll keep this issue open for now!

MaartenGr avatar Oct 04 '21 06:10 MaartenGr

@doubianimehdi This took me a whole lot longer than anticipated but I created a pull request with just this feature here. It allows you to create a hierarchy of topics together with their topic representations. Simply hovering, as you mentioned in your post, shows the topic representation at that particular stage.

The pull request describes the steps for adding this hierarchy, which should be relatively straightforward. I am still working on a couple of other tweaks in BERTopic but it should not take that long before it is merged. Fortunately, you can now already play around with it 😄

MaartenGr avatar Jun 22 '22 13:06 MaartenGr

@MaartenGr Wow ! Thank you so much ! That's gonna be valuable !!! Side note, have you heard of doc2map package ? https://towardsdatascience.com/doc2map-travel-your-documents-like-a-walk-on-google-map-1e8b827fdc04 https://github.com/louisgeisler/Doc2Map

I've tried it but it seems to have a bug that shows the same document on the map multiple times ... otherwise it's a really nice way to visualize the data ! If you could have a look into it and maybe use it in BERTopic in the future that would be great !

doubianimehdi avatar Jun 22 '22 13:06 doubianimehdi

I have not seen that package before. Really interesting way of approaching the visualization! I have indeed been looking at zoom-able options for BERTopic but as of right now I could not find an approach that did not need additional packages or change the API significantly. The main bottleneck seems to be zoom-able visualizations that trigger when passing certain levels. There are methods for doing that, like pydeck, but require more dependencies, unfortunately.

I will continue researching this but if you, or anyone else, has any ideas, please let me know!

MaartenGr avatar Jun 22 '22 14:06 MaartenGr

@MaartenGr Indeed for the first three visualization , you don't need additional package ! it works with matpotlib ! I think you should dig deeper into it because it's the best approach i've seen so far !

doubianimehdi avatar Jun 23 '22 22:06 doubianimehdi

Indeed for the first three visualization , you don't need additional package ! it works with matpotlib !

Which visualizations are you referring to? From the link you posted, I see mostly interactive visualizations that cannot be created with something like matplotlib.

I think you should dig deeper into it because it's the best approach i've seen so far !

I'll do my best but unfortunately cannot make any promises.

MaartenGr avatar Jun 24 '22 05:06 MaartenGr

image I'm talking about this one ! It doesn't require the markleaflet applet JS

doubianimehdi avatar Jun 24 '22 05:06 doubianimehdi

@doubianimehdi For the last few weeks, several visualization functions have been in the making that might support what you have been looking for. The main difficulty with these kinds of visualizations is the number of points that can be plotted on the 2D plane which is not easily supported by Plotly without additional dependencies.

Having said that... I think you might be interested in some new visualizations to the pull request that I mentioned earlier, namely .visualize_documents and .visualize_hierarchical_documents. The former visualizes all documents on a 2D plane and the latter then creates a hierarchical format of it based on the hierarchy that was trained with BERTopic.

.visualize_documents()

visualize_documents

.visualize_hierarchical_documents()

visualize_hierarchical_documents

To perform these visualizations, I would highly advise reading through the preliminary documentation in the PR here as some optimization might be necessary to limit any RAM issues. In practice, sampling and not showing the document's content on hover typically help quite a bit.

Hope this helps!

MaartenGr avatar Jun 28 '22 12:06 MaartenGr

@MaartenGr I think it might be what i've been looking for ! Thank you so much !!! I'll tell you when it's finalised !

doubianimehdi avatar Jun 28 '22 14:06 doubianimehdi