clustergram
clustergram copied to clipboard
Can this work with cluster made by top2vec ?
Thanks for your interesting package.
Do you think Clustergram could work with top2vec ? https://github.com/ddangelov/Top2Vec
I saw that there is the option to create a clustergram from a DataFrame.
In top2vec, each "document" to cluster is represented as a embedding of a certain dimension, 256 , for example.
So I could indeed generate a data frame, like this:
x0 | x1 | ... | x255 | topic |
---|---|---|---|---|
0.5 | 0.2 | .... | -0.2 | 2 |
0.7 | 0.2 | .... | -0.1 | 2 |
0.5 | 0.2 | .... | -0.2 | 3 |
Does Clustergram assume anything on the rows of this data frame ? I saw that the from_data method either takes "mean" or "medium" as method to calculate the cluster centers.
In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?
top2vec calculates as well the "topic vectors" as a mean of the "document vectors", I believe.
If I understand correctly, columns x0 ... x255
are input data while topic
is a resulting cluster label? Then you should be able to use Clustergram.from_data
.
Assuming you have different versions of topic
result, you need to create a df with as many topic
columns as your results (ideally sorted).
In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?
If that means that resulting cluster centers are not the mean/median of the values, then yes. If you know them, you can use Clustergram.from_centers
method instead to pass them directly.
If you can provide some minimal example, I can try to work it out.
Also note, that both from_data
and from_centers
may be buggy :). Worth playing with them to catch and fix them though.
I have indeed teh cluster centers and try to use the from_centers
method.
I think I could construct easely the cluster_centers
dictionary, but I have no idea what the labels
data frame should contain.
Lets assume I have cluster centers which have 10 dimensions, and 1 , 2 and 3 clusters.
So the cluster centers dictionary should be
{
1: [[0,0,1,3,0,5,3,2,7,8]]
2: [
[1,0,1,3,0,5,3,2,3,8],
[4,0,5,3,7,5,3,2,9,8]]
3: [
[0,0,1,3,0,5,3,2,7,8],
[7,1,1,3,0,5,3,2,0,8],
[0,0,5,3,0,5,3,2,7,8]]
}
correct ?
But I cannot see what the labels
dataframe should be in this case.
Do we need still the original data as depicted above in some way as input to the from_vectors
?
(I use now 10 dimensions, but the same applies to 255 dimensions)
labels
dataframe contains labelling of individual observations from different clustering options. So in the most typical case of K-Means done between 2 and 5 clusters, the first column contains labels for k=2, second for k=3, third for k=4 and fourth for k=5.
k=2 | k=3 | k=4 | k=5 | |
---|---|---|---|---|
observation_1 | 1 | 1 | 3 | 0 |
observation_2 | 0 | 0 | 1 | 4 |
observation_3 | 1 | 2 | 2 | 2 |
Assuming you have a similar option in top2vec
, the first column will contain labels for a result A, second for a result B... From quickly looking at the code, I guess that your options will be based on differenct values in min_count
?
The cluster centers dict above looks alright. You may just need to wrap each into a numpy array to get something like this:
centers = {
1: np.array([[0, 0]]),
2: np.array([[-1, -1], [1, 1]]),
3: np.array([[-1, -1], [1, 1], [0, 0]]),
}
labels
dataframe contains labelling of individual observations from different clustering options. So in the most typical case of K-Means done between 2 and 5 clusters, the first column contains labels for k=2, second for k=3, third for k=4 and fourth for k=5.k=2 k=3 k=4 k=5 observation_1 1 1 3 0 observation_2 0 0 1 4 observation_3 1 2 2 2 Assuming you have a similar option in
top2vec
, the first column will contain labels for a result A, second for a result B... From quickly looking at the code, I guess that your options will be based on differenct values inmin_count
?
By "label" you mean "which cluster" ? So I read the table above as:
"In the situation of 2 clusters, observation_1 was in cluster 1, observation 2 in cluster 0, observation 3 in cluster 1" ... In the situation of 5 clusters, observation_1 was in cluster 0, observation_2 was in cluster 4, observation_3 in cluster 2
So the table has one row for each observation, correct ?
Yes, precisely.
Ok, I will give it a try.
I use top2vec to cluster 55000 documents.
The initial run of top2vec created 401 clusters, which I can "reduce to any size", which I would the do step by step and go from 401 to 0
So my labels table would be big...
55000 * 401
Do you think it makes any sense to create a clustergram as big as this... Our final goal is obviously to find the "best cluster size" from the clustergram...
Clustergram itself should deal with it but keep in mind that you'll need to be able to interpret it. The new interactive exploration can help you with that but still, that is a lot of options to look at. There's no assumption about the data? I.e. you normally know if you're looking for 5, 25 or 150 clusters.
As we deal with text, 55000 scientific paper abstracts, and word/paragraph vectors, any mathematical assumptions are very difficult. The vector representation of the text is so far away from the text , that this is very tricky The notion of "how many topics are present in a given text corpus" is not well defined, and a continuum.
So frankly, we don't have a clue on how many clusters to expect.
top2vec does something sensitive, and chooses a certain number of topics automatically by some internal criteria. That's one reason, why we like the top2vec approach
In that case, I'd suggest trying to get maximum from bokeh()
visualisation of clustergram so you can explore different parts of it.
@behrica did manage to make it work by any chance?
@behrica I have the same goal as you but I'm using BERTopic ... I would be interested in seeing what you did if you managed to use it
@doubianimehdi Can you share a reproducible example of your problem? So I could try playing with that and figure out the solution?
I have an hdbscan method with the clusters informations but I don't know how to use it in clustergram ...
If you can share the code and some sample data so I can reproduce what you're doing, I can have a look at the way of using the result within a clustergram. You can check this guide on how to prepare such example - https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
@martinfleis @doubianimehdi Our overall goal was to do (automatic) hyper parameter optimisation with top2vec. The top2vec code does not come with an implementation of a metric, so I was exploring some other forms of "cluster evaluation" and landed here.
In the meanwhile we found an implementation of a metric, so I did not explore further the usage of clustergram
for top2vec.