data-science-your-way icon indicating copy to clipboard operation
data-science-your-way copied to clipboard

Double-counting the documents containing an item

Open jianle4github opened this issue 6 years ago • 0 comments

If an item, for example, "Bourqoqne" appears multiple times in a given document, "Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France", your algorithm will append this same item into the IrIndex.index list and IrIndex.tf list multiple times. This multiple-append implementation distorts the calculation of total number of documents containing the given item in the following code:

idf = log( float( len(self.documents) ) / float( len(self.tf[term]) ) )

I changed the code from:

for term in terms: if term not in self.index: self.index[term] = [] self.tf[term] = []

        self.index[term].append(document_pos)
        self.tf[term].append(terms.count(term))

to:

for term in terms: if term not in self.index: self.index[term] = [] self.tf[term] = []

        if document_pos not in self.index[term]:
            self.index[term].append(document_pos)
            self.tf[term].append(terms.count(term))

by skipping the subsequent append operations if an item in conjunction with its containing document is already recorded inside an IrIndex object.

jianle4github avatar Dec 16 '17 22:12 jianle4github