uda icon indicating copy to clipboard operation
uda copied to clipboard

Global word frequency calculation

Open ClaudiaShu opened this issue 1 year ago • 0 comments

Hi, I have a question about computing the replacement S score.

In your paper, the score is obtained by $S(w) = freq(w)IDF(w)$. However, in the code, this score is calculated by adding the TF-IDF score of a term in every document as below. However, $freq(w)$ in the corpus is not the sum of word frequency in a document. Moreover, the idf score of a term in the corpus should always be the same since the number of documents that contains term $w$ and the number of documents are always the same.

# Compute TF-IDF
tf_idf = {}
for i in range(len(examples)):
  cur_word_dict = {}
  cur_sent = copy.deepcopy(examples[i].word_list_a)
  if examples[i].text_b:
    cur_sent += examples[i].word_list_b
  for word in cur_sent:
    if word not in tf_idf:
      tf_idf[word] = 0
    tf_idf[word] += 1. / len(cur_sent) * idf[word]

ClaudiaShu avatar May 22 '23 14:05 ClaudiaShu