python-wordsegment
python-wordsegment copied to clipboard
Correctly merge lowercase and uppercase bigrams
Some entries in the wordsegment/bigrams.txt file used to be duplicated. In particular, each bigrams was lowercased, but since some bigrams had an uppercase and lowercase appearance, the same bigram appeared in lowercase twice. The code only uses one of these entries, causing the frequency of these bigrams to be underestimated.
The attached program lowercase_ngrams.py lowercases its input while merging the frequencies correctly. The wordsegment/bigrams.txt file is updated using this program. The wordsegment/unigrams.txt file did not have this issue, so it was not changed.
A new test was added to tests/test_coverage.py, showing how "helloworld" is now correctly segmented as "hello world". Past iterations would segment this as "helloworld" because the frequency of the bigram was underestimated.