BitFunnel icon indicating copy to clipboard operation
BitFunnel copied to clipboard

DocumentFrequencyTable appears to have mildly incorrect results

Open danluu opened this issue 7 years ago • 0 comments

This is using the data from chunk1 of http://bitfunnel.org/wikipedia-as-test-corpus-for-bitfunnel/ and the code as of a740476e7955452c2dd6396367c2e8fc341e995d. statistics and a termtable were regenerated and the config files from the post weren't used.

When we run verify log on terms with frequency 0.00102168 (I believe that means terms should appear exactly 18 times), terms appear between 0 and 23 times. The really low numbers could be from a known data cleaning bug and there aren't many of those outliers, but if exclude those, we get terms that appear between 14 and 23 times.

I'm tracking down another bug that's causing many of these terms to map to all 1s rank 0 rows and I may not fix this bug as side effect of tracking down and fixing the that bug, but I think there's another bug here. I think that if we generate stats from a corpus and then verify the corpus, the verification results should match the stats.

danluu avatar Oct 28 '16 23:10 danluu