icu ICU-20780 Update thaidict.txt with more words

ICU-20780 Update thaidict.txt with more words

Open nickt1512 opened this issue 4 years ago • 10 comments

Updating Thai dictionary with more words. The words are taken from the origin thaidict.txt file combined with words from the following codebase https://github.com/PyThaiNLP/pythainlp/blob/2.0/pythainlp/corpus/words_th.txt. Words that were commented out in the previous version were kept.

Checklist

[x] Issue filed: https://unicode-org.atlassian.net/browse/ICU-20780
[x] Updated PR title and link in previous line to include Issue number
[x] Issue accepted
[x] Tests included
[x] Documentation is changed or added

Aug 16 '19 22:08 nickt1512

All committers have signed the CLA.

Aug 16 '19 22:08 CLAassistant

License info from https://github.com/PyThaiNLP/pythainlp/blob/dev/README.md

--> https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md

Sep 11 '19 18:09 hagbard

Will CC-BY make it compatible with ICU terms?

As a contributor for PyThaiNLP, we (PyThaiNLP) can reconsider about the license of dataset to make it more usable for the wider community.

Jun 26 '20 09:06 bact

Now, We (PyThaiNLP) change the license of dataset to CC-0. https://github.com/PyThaiNLP/pythainlp/releases/tag/v2.2.2

Jul 11 '20 04:07 wannaphong

Any updates on this? I'm using something that uses ICU to segment words but it seems to perform very poorly due to the limited dictionary.

Jun 08 '21 10:06 artt

@nickt1512 you may like to update the word list from PyThaiNLP.

We recently found lots of misspellings in the dictionary, some are documented here https://github.com/PyThaiNLP/pythainlp/issues/557 .

Updated dictionary with corrections is already in PyThaiNLP dev branch.

Jun 08 '21 20:06 bact

@artt you can also have your own custom dictionary.

icu4c provides gendict command line tool to convert a text file contains word list into a ICU dictionary format (using trie structure).

usage: https://manpages.debian.org/unstable/icu-devtools/gendict.1.en.html
source code: https://github.com/unicode-org/icu/tree/main/icu4c/source/tools/gendict

Jun 08 '21 20:06 bact

@nickt1512 @bact

I looked into the build error issue and it seems like ICU doesn't like words with spaces in it. After removing 611 words (mostly with spaces for ไม้ยมก, and people's names), the build was successful. In case you'd like to update the PR. Thanks! thaidict_no_space.txt

Jun 20 '21 04:06 artt

40000+ additions seems like a lot, on top of just 26000 words now. This seems excessive. Did someone measure the line breaking quality? By how much does it improve with this larger dictionary? How much could it improve with a smaller addition?

Oct 27 '22 16:10 markusicu

40000+ additions seems like a lot, on top of just 26000 words now. This seems excessive. Did someone measure the line breaking quality? By how much does it improve with this larger dictionary? How much could it improve with a smaller addition?

Since the dictionary from PyThaiNLP tend to also include misspelled words (intentionally), I guess any line breaker that use the dictionary can be more robust when encountered those kind of words. But there's no comparison yet between the current ICU one and the PyThaiNLP one in terms of quality.

Mar 09 '23 11:03 bact

icu icu copied to clipboard

ICU-20780 Update thaidict.txt with more words

Checklist

icu
icu copied to clipboard