rake-nltk icon indicating copy to clipboard operation
rake-nltk copied to clipboard

Word frequency calculation is wrong

Open BALaka-18 opened this issue 5 years ago • 0 comments

According to the function of frequency calculation :

def _build_frequency_dist(self, phrase_list):

    """Builds frequency distribution of the words in the given body of text.
    :param phrase_list: List of List of strings where each sublist is a
                        collection of words which form a contender phrase.
    """

    self.frequency_dist = Counter(chain.from_iterable(phrase_list))

Tracing back to the calculation of phrase_list :

def _generate_phrases(self, sentences):

    """Method to generate contender phrases given the sentences of the text
    document.
    :param sentences: List of strings where each string represents a
                      sentence which forms the text.
    :return: Set of string tuples where each tuple is a collection
             of words forming a contender phrase.
    """
    phrase_list = set()
    # Create contender phrases from sentences.
    for sentence in sentences:
        word_list = [word.lower() for word in wordpunct_tokenize(sentence)]
        phrase_list.update(self._get_phrase_list_from_words(word_list))
    return phrase_list

Clearly, phrase_list is a set, and contains unique keywords. So if keywords repeat in a text, they're ignored, and the value of frequency, as tested by me, comes out faulty.

I have modified the Rake() object to ensure the calculations are correct. @csurfer ,kindly assign me this issue, so I can create a pull request.

BALaka-18 avatar Jul 29 '20 16:07 BALaka-18