BERTopic
BERTopic copied to clipboard
how to generate embedding for each documents?
I would like to compute the distance between each documents and the topic it belongs to. While there is no function that can be called.
You are correct that there is not a function that calculates the distance between each document and the topic it belongs to. This is because that is not the procedure for actually generating the topic clusters. Since we cluster the UMAP-reduced embeddings using a density-based clustering technique, the distance between a document's embedding and the centroid of a cluster is not that accurate of a proxy of a document's membership to that cluster.
Having said that, to calculate the distance between a document and a cluster of topics using embeddings you can need to generate the embeddings for each document yourself. As is done here and then average the embeddings of the documents in each cluster. Then, you can use cosine similarity or the dot product to find the distance between each document and a topic. Note though that this is generally not advised since clusters are found using a density-based procedure. Using a centroid-based approach to calculate distance likely results in inaccurate distances.
thanks for your reply. Actually I found that in trained 'model' variable, there do exist topic-level embeddings which was generated by Sentence-Transformer(cuz they have same shape). I don't know what's the meaning of such embeddings, and could I use those embedding as topic embedding to calculate the distance? Thanks again
Ah right, totally forgot about those 😅. Yes, you can use those to calculate the distances between documents and topics but these topic embeddings are just the unweighted average of document embeddings in a topic (i.e., the center of a topic). Do note that what I mentioned before still applies, it may not be accurate to calculate the distance between a document and the center of a topic as the center may not be that representative of the topic.
Ah right, totally forgot about those 😅. Yes, you can use those to calculate the distances between documents and topics but these topic embeddings are just the unweighted average of document embeddings in a topic (i.e., the center of a topic). Do note that what I mentioned before still applies, it may not be accurate to calculate the distance between a document and the center of a topic as the center may not be that representative of the topic.
that's great. I suppose such method can be applied in 'soft-distance' measurement like similarity etc. Really appreciate ~
Hi,
I am joining the discussion since I would like to extract the average of document embeddings for topics, in order to compare similarities between the topics. If I understood it correctly from the previous discussion, those can be extracted from the model, which would speed up a workflow a lot. However, I am not sure how to do this because I do not see such a function in the documentation (e.g. .get_representative_docs() can be used for extracting representative docs but I do not see such a function for embeddings). Any help would be appreciated. Thanks.
@kjaksic We can use the topic_model.topic_embeddings to get the average document embedding for each topic. Although you can use that to compare distances, I would advise going with topic_model.c_tf_idf instead. It does not assume that the center of a cluster is the best-representing sample and instead focuses on the words that make up the cluster.
@MaartenGr Thank you very much for the explanation and suggestion, will look into it!
@MaartenGr Is it expected behavior that the model extracted 569 topics, but the embedding matrix has a dimension of 570*384? Does this mean that the 0 index in the embedding matrix refers to the topic -1, that is, unclustered comments? Thank you.
@kjaksic
Does this mean that the 0 index in the embedding matrix refers to the topic -1, that is, unclustered comments?
Yes, the 0 index is indeed topic -1!
Hi @MaartenGr ,
I have extracted the average document embedding for each topic using topic_model.topic_embeddings. Also, I estimated the average document embedding for each topic by calculating the average of individual document embeddings (topic_model._extract_embeddings) that form that topic. I would assume that these two procedures would lead to the same results. However, the correlation between average document embedding is not 1 (around .70 on average) and the correlation between topics slightly differs depending on how the centroid was extracted. I assume that the weighting of the documents is not the same across the procedures, but would appreciate it if you have some thoughts about this. Thanks.
@kjaksic When you run topic_model.topic_embeddings what you get back is not the average document embedding for each topic. Although a topic consists of a number of documents, getting the average embedding of documents in a topic might not be an accurate representation if you are using a density-based clustering technique. Instead, we create topic representations through c-TF-IDF and weight certain words in a topic. We use these weighted words to create the topic embeddings as follows:
https://github.com/MaartenGr/BERTopic/blob/63fd2a2ea3ebdd0ac91347a13103b5fe6a1d741f/bertopic/_bertopic.py#L1591-L1602
In other words, the topic embedding is created by taking creating word embeddings of the top n words in a topic. Then, we average them together but apply a weighted scheme based on the c-TF-IDF values of these words. This does mean, however, that the word embeddings take little context into account which we assume to be mitigated by averaging over several words. In practice, it might be worthwhile to combine all words to a single sentence and create an embedding out of that but that is for a future version.
However, the correlation between average document embedding is not 1 (around .70 on average) and the correlation between topics slightly differs depending on how the centroid was extracted. I assume that the weighting of the documents is not the same across the procedures, but would appreciate it if you have some thoughts about this. Thanks.
Since the topic embedding is not the same as the average document embedding, it is not surprising that the correlation is not perfect.
@MaartenGr Thank you for such a detailed response! It is all clear.
Hi Maarten,
Since inclusion of the time component (topics over time) in the model allows for the topic representation (top n words) to differ across the time, this should also affect the topic embedding and embedding at time point t may differ from the embedding at the time point t+1? Is it possible to extract topic embeddings at the different timestamps if this is the case?
Thanks.
sri, 22. lip 2022. u 13:31 Maarten Grootendorst @.***> napisao je:
@kjaksic https://github.com/kjaksic When you run topic_model.topic_embeddings what you get back is not the average document embedding for each topic. Although a topic consists of a number of documents, getting the average embedding of documents in a topic might not be an accurate representation if you are using a density-based clustering technique. Instead, we create topic representations through c-TF-IDF and weight certain words in a topic. We use these weighted words to create the topic embeddings as follows:
https://github.com/MaartenGr/BERTopic/blob/63fd2a2ea3ebdd0ac91347a13103b5fe6a1d741f/bertopic/_bertopic.py#L1591-L1602
In other words, the topic embedding is created by taking creating word embeddings of the top n words in a topic. Then, we average them together but apply a weighted scheme based on the c-TF-IDF values of these words. This does mean, however, that the word embeddings take little context into account which we assume to be mitigated by averaging over several words. In practice, it might be worthwhile to combine all words to a single sentence and create an embedding out of that but that is for a future version.
However, the correlation between average document embedding is not 1 (around .70 on average) and the correlation between topics slightly differs depending on how the centroid was extracted. I assume that the weighting of the documents is not the same across the procedures, but would appreciate it if you have some thoughts about this. Thanks.
Since the topic embedding is not the same as the average document embedding, it is not surprising that the correlation is not perfect.
— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/477#issuecomment-1162983425, or unsubscribe https://github.com/notifications/unsubscribe-auth/APTVVEZ3HAESMSQSPOV6QGDVQL2R5ANCNFSM5RDXSJ7A . You are receiving this because you were mentioned.Message ID: @.***>
Since inclusion of the time component (topics over time) in the model allows for the topic representation (top n words) to differ across the time, this should also affect the topic embedding and embedding at time point t may differ from the embedding at the time point t+1?
In topics_over_time, we are not using the topic embeddings to create the topic representations. Instead, we use the c-TF-IDF matrix at different timestamps to extract its representation. You like mentioned, this indeed means that the c-TF-IDF representation at time point t likely differs from the c-TF-IDF representation at time point t+1.
Hi Marteen,
thank you for the answer. Would it be possible to extract c-TF-IDF matrix at different time points and multiply it with word embeddings of keywords (as the topic embedding function does for the whole model) to create topic embeddings at different timestamps? This would also require us to extract keyword embeddings, if possible to do.
Our goal is to have topic embeddings at different timestamps in order to calculate correlations between the topics and see how those change (having fixed topic embeddings across the time probably would artificially increase correlations between the topics). Since we cannot calculate average document embeddings to extract the centroid when using HDBSCAN clustering, applying the topic (word) embedding function at different timestamps seems like a possible solution.
Alternatively, would applying Kmeans instead HDBSCAN (if the model quality is not significantly reduced) allow us to average document embeddings and extract the centroid, or still, due to the UMAP application, distances between the clusters would not be preserved even in that case.
Thank you.
thank you for the answer. Would it be possible to extract c-TF-IDF matrix at different time points and multiply it with word embeddings of keywords (as the topic embedding function does for the whole model) to create topic embeddings at different timestamps? This would also require us to extract keyword embeddings, if possible to do.
Our goal is to have topic embeddings at different timestamps in order to calculate correlations between the topics and see how those change (having fixed topic embeddings across the time probably would artificially increase correlations between the topics). Since we cannot calculate average document embeddings to extract the centroid when using HDBSCAN clustering, applying the topic (word) embedding function at different timestamps seems like a possible solution.
You can generate topics over time by following the instructions here. The way that works is by generating c-TF-IDF representations at each timestamp t by selecting a subset of the documents. It does not re-train the c-TF-IDF representation so you can compare them across timestamps.
To extract those representations, you would have to modify the code in:
https://github.com/MaartenGr/BERTopic/blob/63fd2a2ea3ebdd0ac91347a13103b5fe6a1d741f/bertopic/_bertopic.py#L406
to track and return the c_tf_idf variable at each timestamp.
Alternatively, would applying Kmeans instead HDBSCAN (if the model quality is not significantly reduced) allow us to average document embeddings and extract the centroid, or still, due to the UMAP application, distances between the clusters would not be preserved even in that case.
The difficulty with a centroid-based technique in both the clustering and representation is that the centroid might not actually represent the cluster that well. Especially with non-convex clusters, this assumption typically does not hold that well. Having said that, you could try averaging the documents. Just know that these assumptions exist.
Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!
it might be worthwhile to combine all words to a single sentence and create an embedding out of that but that is for a future version.
@MaartenGr Has the "future version" been developed yet? Does the value of model.topic_embeddings_ now base on popular words or on passages?
By the way, the link here is 404. Does it reference to this?
As is done here and then average the embeddings of the documents in each cluster.
@syGOAT The topic embeddings are created here:
https://github.com/MaartenGr/BERTopic/blob/424cefc68ede08ff9f1c7e56ee6103c16c1429c6/bertopic/_bertopic.py#L3882
This means that initially, the topic embeddings are based on the centroid of a cluster. This might change on certain circumstances when we do not have all embeddings of a cluster. In those cases, the topic embeddings are created based on the average of word embeddings.
Note that the topic embeddings are not the only method of representing topics. I frequently use the c-TF-IDF representations as a way to also represent the topics alongside or in place of the topic embeddings.
@MaartenGr I have fully understood. Thank you for your reply!