clustergram Can this work with cluster made by top2vec ?

Thanks for your interesting package.

Do you think Clustergram could work with top2vec ? https://github.com/ddangelov/Top2Vec

I saw that there is the option to create a clustergram from a DataFrame.

In top2vec, each "document" to cluster is represented as a embedding of a certain dimension, 256 , for example.

So I could indeed generate a data frame, like this:

x0	x1	...	x255	topic
0.5	0.2	....	-0.2	2
0.7	0.2	....	-0.1	2
0.5	0.2	....	-0.2	3

Does Clustergram assume anything on the rows of this data frame ? I saw that the from_data method either takes "mean" or "medium" as method to calculate the cluster centers.

In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?

top2vec calculates as well the "topic vectors" as a mean of the "document vectors", I believe.

May 25 '21 10:05 behrica

If I understand correctly, columns x0 ... x255 are input data while topic is a resulting cluster label? Then you should be able to use Clustergram.from_data.

Assuming you have different versions of topic result, you need to create a df with as many topic columns as your results (ideally sorted).

In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?

If that means that resulting cluster centers are not the mean/median of the values, then yes. If you know them, you can use Clustergram.from_centers method instead to pass them directly.

If you can provide some minimal example, I can try to work it out.

Also note, that both from_data and from_centers may be buggy :). Worth playing with them to catch and fix them though.

May 25 '21 11:05 martinfleis

I have indeed teh cluster centers and try to use the from_centers method.

I think I could construct easely the cluster_centers dictionary, but I have no idea what the labels data frame should contain.

Lets assume I have cluster centers which have 10 dimensions, and 1 , 2 and 3 clusters.

So the cluster centers dictionary should be

{
1: [[0,0,1,3,0,5,3,2,7,8]]
2: [
     [1,0,1,3,0,5,3,2,3,8],
     [4,0,5,3,7,5,3,2,9,8]]

3: [
     [0,0,1,3,0,5,3,2,7,8],
     [7,1,1,3,0,5,3,2,0,8],
     [0,0,5,3,0,5,3,2,7,8]]
}

correct ?

May 26 '21 13:05 behrica

But I cannot see what the labels dataframe should be in this case. Do we need still the original data as depicted above in some way as input to the from_vectors? (I use now 10 dimensions, but the same applies to 255 dimensions)

May 26 '21 13:05 behrica

labels dataframe contains labelling of individual observations from different clustering options. So in the most typical case of K-Means done between 2 and 5 clusters, the first column contains labels for k=2, second for k=3, third for k=4 and fourth for k=5.

	k=2	k=3	k=4	k=5
observation_1	1	1	3	0
observation_2	0	0	1	4
observation_3	1	2	2	2

Assuming you have a similar option in top2vec, the first column will contain labels for a result A, second for a result B... From quickly looking at the code, I guess that your options will be based on differenct values in min_count?

May 26 '21 14:05 martinfleis

The cluster centers dict above looks alright. You may just need to wrap each into a numpy array to get something like this:

centers = {
             1: np.array([[0, 0]]),
             2: np.array([[-1, -1], [1, 1]]),
             3: np.array([[-1, -1], [1, 1], [0, 0]]),
          }

May 26 '21 14:05 martinfleis

labels dataframe contains labelling of individual observations from different clustering options. So in the most typical case of K-Means done between 2 and 5 clusters, the first column contains labels for k=2, second for k=3, third for k=4 and fourth for k=5.

k=2 k=3 k=4 k=5 observation_1 1 1 3 0 observation_2 0 0 1 4 observation_3 1 2 2 2 Assuming you have a similar option in top2vec, the first column will contain labels for a result A, second for a result B... From quickly looking at the code, I guess that your options will be based on differenct values in min_count?

By "label" you mean "which cluster" ? So I read the table above as:

"In the situation of 2 clusters, observation_1 was in cluster 1, observation 2 in cluster 0, observation 3 in cluster 1" ... In the situation of 5 clusters, observation_1 was in cluster 0, observation_2 was in cluster 4, observation_3 in cluster 2

So the table has one row for each observation, correct ?

May 26 '21 14:05 behrica

Yes, precisely.

May 26 '21 14:05 martinfleis

Ok, I will give it a try.

I use top2vec to cluster 55000 documents.

The initial run of top2vec created 401 clusters, which I can "reduce to any size", which I would the do step by step and go from 401 to 0

So my labels table would be big...

55000 * 401

Do you think it makes any sense to create a clustergram as big as this... Our final goal is obviously to find the "best cluster size" from the clustergram...

May 26 '21 14:05 behrica

Clustergram itself should deal with it but keep in mind that you'll need to be able to interpret it. The new interactive exploration can help you with that but still, that is a lot of options to look at. There's no assumption about the data? I.e. you normally know if you're looking for 5, 25 or 150 clusters.

May 26 '21 15:05 martinfleis

As we deal with text, 55000 scientific paper abstracts, and word/paragraph vectors, any mathematical assumptions are very difficult. The vector representation of the text is so far away from the text , that this is very tricky The notion of "how many topics are present in a given text corpus" is not well defined, and a continuum.

So frankly, we don't have a clue on how many clusters to expect.

top2vec does something sensitive, and chooses a certain number of topics automatically by some internal criteria. That's one reason, why we like the top2vec approach

May 26 '21 19:05 behrica

In that case, I'd suggest trying to get maximum from bokeh() visualisation of clustergram so you can explore different parts of it.

May 26 '21 19:05 martinfleis

@behrica did manage to make it work by any chance?

Jul 12 '21 19:07 martinfleis

@behrica I have the same goal as you but I'm using BERTopic ... I would be interested in seeing what you did if you managed to use it

Oct 08 '21 13:10 doubianimehdi

@doubianimehdi Can you share a reproducible example of your problem? So I could try playing with that and figure out the solution?

Oct 08 '21 13:10 martinfleis

I have an hdbscan method with the clusters informations but I don't know how to use it in clustergram ...

Oct 08 '21 13:10 doubianimehdi

If you can share the code and some sample data so I can reproduce what you're doing, I can have a look at the way of using the result within a clustergram. You can check this guide on how to prepare such example - https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

Oct 08 '21 13:10 martinfleis

@martinfleis @doubianimehdi Our overall goal was to do (automatic) hyper parameter optimisation with top2vec. The top2vec code does not come with an implementation of a metric, so I was exploring some other forms of "cluster evaluation" and landed here.

In the meanwhile we found an implementation of a metric, so I did not explore further the usage of clustergram for top2vec.

Nov 19 '21 09:11 behrica

clustergram clustergram copied to clipboard

Can this work with cluster made by top2vec ?

clustergram
clustergram copied to clipboard