uda
uda copied to clipboard
Global word frequency calculation
Hi, I have a question about computing the replacement S score.
In your paper, the score is obtained by $S(w) = freq(w)IDF(w)$. However, in the code, this score is calculated by adding the TF-IDF score of a term in every document as below. However, $freq(w)$ in the corpus is not the sum of word frequency in a document. Moreover, the idf score of a term in the corpus should always be the same since the number of documents that contains term $w$ and the number of documents are always the same.
# Compute TF-IDF
tf_idf = {}
for i in range(len(examples)):
cur_word_dict = {}
cur_sent = copy.deepcopy(examples[i].word_list_a)
if examples[i].text_b:
cur_sent += examples[i].word_list_b
for word in cur_sent:
if word not in tf_idf:
tf_idf[word] = 0
tf_idf[word] += 1. / len(cur_sent) * idf[word]