icu
icu copied to clipboard
ICU-20780 Update thaidict.txt with more words
Updating Thai dictionary with more words. The words are taken from the origin thaidict.txt file combined with words from the following codebase https://github.com/PyThaiNLP/pythainlp/blob/2.0/pythainlp/corpus/words_th.txt. Words that were commented out in the previous version were kept.
Checklist
- [x] Issue filed: https://unicode-org.atlassian.net/browse/ICU-20780
- [x] Updated PR title and link in previous line to include Issue number
- [x] Issue accepted
- [x] Tests included
- [x] Documentation is changed or added
License info from https://github.com/PyThaiNLP/pythainlp/blob/dev/README.md
--> https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md
Will CC-BY make it compatible with ICU terms?
As a contributor for PyThaiNLP, we (PyThaiNLP) can reconsider about the license of dataset to make it more usable for the wider community.
Now, We (PyThaiNLP) change the license of dataset to CC-0. https://github.com/PyThaiNLP/pythainlp/releases/tag/v2.2.2
Any updates on this? I'm using something that uses ICU to segment words but it seems to perform very poorly due to the limited dictionary.
@nickt1512 you may like to update the word list from PyThaiNLP.
We recently found lots of misspellings in the dictionary, some are documented here https://github.com/PyThaiNLP/pythainlp/issues/557 .
Updated dictionary with corrections is already in PyThaiNLP dev branch.
@artt you can also have your own custom dictionary.
icu4c provides gendict
command line tool to convert a text file contains word list into a ICU dictionary format (using trie structure).
- usage: https://manpages.debian.org/unstable/icu-devtools/gendict.1.en.html
- source code: https://github.com/unicode-org/icu/tree/main/icu4c/source/tools/gendict
@nickt1512 @bact
I looked into the build error issue and it seems like ICU doesn't like words with spaces in it. After removing 611 words (mostly with spaces for ไม้ยมก, and people's names), the build was successful. In case you'd like to update the PR. Thanks! thaidict_no_space.txt
40000+ additions seems like a lot, on top of just 26000 words now. This seems excessive. Did someone measure the line breaking quality? By how much does it improve with this larger dictionary? How much could it improve with a smaller addition?
40000+ additions seems like a lot, on top of just 26000 words now. This seems excessive. Did someone measure the line breaking quality? By how much does it improve with this larger dictionary? How much could it improve with a smaller addition?
Since the dictionary from PyThaiNLP tend to also include misspelled words (intentionally), I guess any line breaker that use the dictionary can be more robust when encountered those kind of words. But there's no comparison yet between the current ICU one and the PyThaiNLP one in terms of quality.