Top2Vec
Top2Vec copied to clipboard
Topic Information Gain implementation
Dear Dimo,
I am really enjoying learning about the top2vec model and would like to master the PWI evaluation measure, as it seems to be the most informative and intuitive measure for topic evaluation. However, whenever I try to run my implementation of the function on the newsgroup data set like in your paper, I seem to get very low values.
Below is a dummy example of my implementation evaluating 1 fake topic with the PWI evaluation measure. Any comments on where to improve are highly appreciated.
import nltk
from nltk import ngrams, FreqDist
from gensim.utils import tokenize
documents= [
"The man eats bread in the street in the city of Las Vegas",
"The chicken crosses the road in Las Vegas",
"The earth is the only planet with humans",
"The woman eats bread on the road, she was hungry and needed food",
"Another planet in the Galaxy is called mars, mars ,mars"
"Different subject",
"Lorem Ipsum Bla Bla"
]
#Create marginal probability distribution for all terms in set of documents
freq_dist = FreqDist()
tokenized = [list(tokenize(s)) for s in documents]
for doc in tokenized:
for word in doc:
freq_dist[word.lower()] += 1
#Store total number of documents
N = len(documents)
pattern = r'\w+'
#Store total frequency of all terms in document set
F = sum([len(nltk.regexp_tokenize(doc, pattern)) for doc in documents])
#Probability of dj is all 1/all documents
P_dj = (1 / N)
#print(f'P_dj {P_dj}: 1 / N {N}')
#Iterate over all documents
dj_wi_pwi_list = []
for d in documents:
#Store PWIs to sum
#Initiate list for storing PWI values per word for total score of topic
words_in_t = []
#Iterate over n topic words
fake_topic_words = ["earth","planet","the","las", "vegas","chicken","bread","eats"]
for word in fake_topic_words:
#print('word:==========', word)
#print(d)
tokenized_doc = [x.lower() for x in list(nltk.regexp_tokenize(d, pattern))] #Should this be a set?
#print(f'TOKENIZED_DOC:{tokenized_doc}\n')
f_ij = sum([1 for x in tokenized_doc if word==x])
P_wi_dj = (f_ij / F)
#print(f'P_wi_dj:{P_wi_dj} = (f_ij {f_ij} / F {F})\n')
if P_wi_dj == 0:
continue
P_wi = freq_dist.freq(word)
pmi = np.log(P_wi_dj/(P_wi * P_dj))
#print(f'pmi:{pmi} = np.log(P_wi_dj {P_wi_dj}/(p_wi {p_wi} * P_dj {P_dj})\n')
#pmi = np.log(p_wd) - np.log(p_w) - np.log(p_d)
# p(d|w)
#p_d_on_w = P_wi_dj * P_wi
pwi = P_wi_dj * pmi
#print(f'pwi:{pwi} = P_wi_dj{P_wi_dj}* pmi {pmi}\n')
words_in_t.append(pwi)
#print(pwi)
dj_wi_pwi_list.append(pwi)
topic_pwi = sum(dj_wi_pwi_list)
print(f"topic_pwi:{topic_pwi} = sum({dj_wi_pwi_list})")
+1. I'd also like to see an efficient implementation of PWI. I know this has been mentioned by others, as well.
If P_wi is zero we have an error:
ZeroDivisionError Traceback (most recent call last)
ZeroDivisionError: float division by zero
+1
This has been brought up many times, I will eventually have a Topic Information Gain Implementation however it is not priority at the moment. I have a version that is somewhat efficient but it will need some work to scale to really large datasets.
Kindly share your implementation of Topic information gain without it we can not compare it with other models. I will be very thankful to you for sharing that. [email protected]
In agreement with Yasir, I would prefer to have a slow implementation that allows for verification of the metric and comparison across different models. Would this be possible?
Hello! I also need the code of topic information for comparison with other methods. If you could share it would be of great value to compare and validate my code.
This has been brought up many times, I will eventually have a Topic Information Gain Implementation however it is not priority at the moment. I have a version that is somewhat efficient but it will need some work to scale to really large datasets.
would you mind publish it in a test branch at least you get some feedbacks early?
It would be really great if you can share topic information gain implementation. This will enable to compare models.
Hey, great work on the Top2Vec model, we are really excited to use it!
I have implemented the measure PWI in order to reproduce and compare Top2Vec and LDA in different collections. I followed Equation 5 from the paper in here.
For sanity checking, I have used the same 20newsgroups dataset and the model from this repository. However, I cannot reproduce the values from Figure 6. For instance, when I use 20 topics and 20 words, I get PWI ~1300 using log10 and 2600 using log base 2. Also, when I run a vanilla LDA using gensim implementation (without pre-processing) I get PWI around 5783 also for 20 topics and 20 words.
Could you please have a look at my implementation below and see if something is wrong with the metric implementation? If there is nothing wrong with the implementation, is there anything you could think of that could explain this behaviour?
Thank you in advance! :)
Hey, great work on the Top2Vec model, we are really excited to use it!
I have implemented the measure PWI in order to reproduce and compare Top2Vec and LDA in different collections. I followed Equation 5 from the paper in here.
For sanity checking, I have used the same 20newsgroups dataset and the model from this repository. However, I cannot reproduce the values from Figure 6. For instance, when I use 20 topics and 20 words, I get PWI ~1300 using log10 and 2600 using log base 2. Also, when I run a vanilla LDA using gensim implementation (without pre-processing) I get PWI around 5783 also for 20 topics and 20 words.
Could you please have a look at my implementation below and see if something is wrong with the metric implementation? If there is nothing wrong with the implementation, is there anything you could think of that could explain this behaviour?
Thank you in advance! :)
Thank you for working this implementation. When estimating P(w)
, should we estimate P(w)
as the frequency of w
divided by the total frequency of all words, or should we estimate P(w)
as the number of documents word w
appears in divided by the total number of documents?
I've seen the latter implementation when estimating PMI (see here). This approach is agnostic to the number of times word w
appears in a document and only considers how many documents it appears in at least once.
Here is my implementation of getting a coherence score. I used it on around a dataset of 100,000 very long texts. It took a few minutes to run ~8min: This function assumes you have a dataframe with a column dedicated to the text of interest. Rename the column to ProcessedTexts Try to use preprocessed texts. I used spacy to preprocess the texts Make sure to use these imports: import gensim.corpora as corpora from gensim.utils import tokenize from gensim.models import CoherenceModel
def t2vCoherence(df, topic_words):
tokenized = [list(tokenize(doc)) for doc in df.ProcessedTexts.tolist()]
id2word = corpora.Dictionary(tokenized)
corpus = [id2word.doc2bow(text) for text in tokenized]
# make sure you grab the topic words from the topic model and convert them to a list
coherence_model = CoherenceModel(topics= topic_words ,texts=tokenized
, corpus=corpus, dictionary=id2word, coherence='c_v', topn=50) #Use top 50 words since Top2Vec gets top 50 words
coherence = coherence_model.get_coherence()
print("Model Coherence C_V is:{0}".format(coherence))
return coherence
Here is my implementation of getting a coherence score. I used it on around a dataset of 100,000 very long texts. It took a few minutes to run ~8min: This function assumes you have a dataframe with a column dedicated to the text of interest. Rename the column to ProcessedTexts Try to use preprocessed texts. I used spacy to preprocess the texts Make sure to use these imports: import gensim.corpora as corpora from gensim.utils import tokenize from gensim.models import CoherenceModel
def t2vCoherence(df, topic_words): tokenized = [list(tokenize(doc)) for doc in df.ProcessedTexts.tolist()] id2word = corpora.Dictionary(tokenized) corpus = [id2word.doc2bow(text) for text in tokenized] # make sure you grab the topic words from the topic model and convert them to a list coherence_model = CoherenceModel(topics= topic_words ,texts=tokenized , corpus=corpus, dictionary=id2word, coherence='c_v', topn=50) #Use top 50 words since Top2Vec gets top 50 words coherence = coherence_model.get_coherence() print("Model Coherence C_V is:{0}".format(coherence)) return coherence
Thanks for this. Does this implement a coherence score for one topic ?
Hey, great work on the Top2Vec model, we are really excited to use it!
I have implemented the measure PWI in order to reproduce and compare Top2Vec and LDA in different collections. I followed Equation 5 from the paper in here.
For sanity checking, I have used the same 20newsgroups dataset and the model from this repository. However, I cannot reproduce the values from Figure 6. For instance, when I use 20 topics and 20 words, I get PWI ~1300 using log10 and 2600 using log base 2. Also, when I run a vanilla LDA using gensim implementation (without pre-processing) I get PWI around 5783 also for 20 topics and 20 words.
Could you please have a look at my implementation below and see if something is wrong with the metric implementation? If there is nothing wrong with the implementation, is there anything you could think of that could explain this behaviour?
Thank you in advance! :)
There was a small mistake I believe.
You uses newgroups.data
instead of docs inside of PWI.
see here: https://gist.github.com/behrica/91b3f958fad80247069ade3b96646dcf
Here is my implementation of getting a coherence score. I used it on around a dataset of 100,000 very long texts. It took a few minutes to run ~8min: This function assumes you have a dataframe with a column dedicated to the text of interest. Rename the column to ProcessedTexts Try to use preprocessed texts. I used spacy to preprocess the texts Make sure to use these imports: import gensim.corpora as corpora from gensim.utils import tokenize from gensim.models import CoherenceModel
def t2vCoherence(df, topic_words): tokenized = [list(tokenize(doc)) for doc in df.ProcessedTexts.tolist()] id2word = corpora.Dictionary(tokenized) corpus = [id2word.doc2bow(text) for text in tokenized] # make sure you grab the topic words from the topic model and convert them to a list coherence_model = CoherenceModel(topics= topic_words ,texts=tokenized , corpus=corpus, dictionary=id2word, coherence='c_v', topn=50) #Use top 50 words since Top2Vec gets top 50 words coherence = coherence_model.get_coherence() print("Model Coherence C_V is:{0}".format(coherence)) return coherence
Thanks for this. Does this implement a coherence score for one topic ?
So the method I used just finds the coherence of the entire topic model, but you can implement it in a way to find the coherence of a single topic. The link here is the source code from gensim. To get the coherence of one topic, you would need to call the segment method on the coherence model, then you would need to call the get_coherence_per_topic() method on the model. This may also be useful. I hope this helps!
Here is my implementation of getting a coherence score. I used it on around a dataset of 100,000 very long texts. It took a few minutes to run ~8min: This function assumes you have a dataframe with a column dedicated to the text of interest. Rename the column to ProcessedTexts Try to use preprocessed texts. I used spacy to preprocess the texts Make sure to use these imports: import gensim.corpora as corpora from gensim.utils import tokenize from gensim.models import CoherenceModel Do you mean by "texts of interest" the input texts to the model ?
And what is "topic_words" ?
Can we implement your code by "only" taking a trained model as input ? A trained model has fields ".documents" and ".topicWords". Are these the 2 required as input to t2vCoherence ?
Topic words can be calculated by calling: words, scores =model._find_topic_words_and_scores(model.topic_vectors) that function returns the topic words first, and the word scores 2nd. That’s how you’d get topic words. You’d convert those into a list, and that is required for this implementation, the .documents is not required since I pass in the documents directly from the data frame parameter. It assumes You have your processed texts stored in a column in a dataframe. Hope this helps
Yes, I have them. I just wanted to be sure that the documents you call "processed texts stored in a column in a dataframe" are the same the the one I pass in to Top2Vec for training.
Or to formulate it differently. It surprises me that you dont't calculate the score from "the model itself" (so the output from Top2Vec training call) It should contain all information needed.
Yes, I have them. I just wanted to be sure that the documents you call "processed texts stored in a column in a dataframe" are the same the the one I pass in to Top2Vec for training.
Or to formulate it differently. It surprises me that you dont't calculate the score from "the model itself" (so the output from Top2Vec training call) It should contain all information needed.
Yeah it was kind of a workaround. It's not using the model directly, but it is using the top 50 words for each topic, and those come from the model's produced topic vectors. It's kind of a hacky way to do it, but it works. I may work on some other implementations here soon. I would definitely be interested to see other people's implementations
@lcschv Thank you for your implementation of PWI. I have a question about your calculation of p(d|w)
In your code you use dict_docs_freqs[i].freq(word)
for this. What is, in my mind, the frequency of w_i
within d_i.
Following Aizawa (P.52), this is the f_ij
and not p(d|w)
. To get p(d|w)
we have to divide f_ij
with f_w_i
(Eq. 23 Aizawa). Page 56 Aizawa assumes that f_ij/ f_w_i
is approximately 1/N_i
, if the occurrence of a term does not differ much across the documents. Did I read something over or misunderstand your code?