python-wordsegment
python-wordsegment copied to clipboard
English word segmentation, written in pure-Python, and based on a trillion-word corpus.
I’m pondering on using this as a service to an app for disabled people who we support who would use this to communicate. We see a lot of users who...
The LDC has the Web 1T 5-gram 10 European Languages published at https://catalog.ldc.upenn.edu/LDC2009T25 Is there any plan to support these languages? If not, can I jump in and contribute? Would...
Bumps [wheel](https://github.com/pypa/wheel) from 0.29.0 to 0.38.1. Changelog Sourced from wheel's changelog. Release Notes UNRELEASED Updated vendored packaging to 22.0 0.38.4 (2022-11-09) Fixed PKG-INFO conversion in bdist_wheel mangling UTF-8 header values...
Added `pyproject.toml`. Replaced importing the version variable with reading it from the file using `read_version`. If we drop `python3 ./setup.py test`, then `setup.py` can be removed completely since now (to...
Hi, I'm having trouble with following code: ``` import wordsegment wordsegment.load() text = "The article went on to say, “For in the pizza shops rich and poor harmoniously congregate; they...
Allows for customization in #33 .
### 1. Summary It would be nice, if WordSegment at least at CLI mode will have the option to preserve all punctuation marks: `.`, `,`, `’` and so on. ###...
- commit 1: test coverage for maintaining original character casing - commit 2: optional cmd line arg for maintaining case in file input (defaults to original, lower cased segment output)...
"a frail 88-year old man" is being outputed as ["a","frail88","year","old"] This doesn't help at all. Having numbers in a block of text is so common in any domain. It's sad...
Some entries in the wordsegment/bigrams.txt file used to be duplicated. In particular, each bigrams was lowercased, but since some bigrams had an uppercase and lowercase appearance, the same bigram appeared...