python-wordsegment Correctly merge lowercase and uppercase bigrams

Correctly merge lowercase and uppercase bigrams

Open kvakil opened this issue 6 years ago • 0 comments

Some entries in the wordsegment/bigrams.txt file used to be duplicated. In particular, each bigrams was lowercased, but since some bigrams had an uppercase and lowercase appearance, the same bigram appeared in lowercase twice. The code only uses one of these entries, causing the frequency of these bigrams to be underestimated.

The attached program lowercase_ngrams.py lowercases its input while merging the frequencies correctly. The wordsegment/bigrams.txt file is updated using this program. The wordsegment/unigrams.txt file did not have this issue, so it was not changed.

A new test was added to tests/test_coverage.py, showing how "helloworld" is now correctly segmented as "hello world". Past iterations would segment this as "helloworld" because the frequency of the bigram was underestimated.

Oct 11 '19 19:10 kvakil

python-wordsegment python-wordsegment copied to clipboard

Correctly merge lowercase and uppercase bigrams

python-wordsegment
python-wordsegment copied to clipboard