icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-20780 Update thaidict.txt with more words

Open nickt1512 opened this issue 4 years ago • 10 comments

Updating Thai dictionary with more words. The words are taken from the origin thaidict.txt file combined with words from the following codebase https://github.com/PyThaiNLP/pythainlp/blob/2.0/pythainlp/corpus/words_th.txt. Words that were commented out in the previous version were kept.

Checklist
  • [x] Issue filed: https://unicode-org.atlassian.net/browse/ICU-20780
  • [x] Updated PR title and link in previous line to include Issue number
  • [x] Issue accepted
  • [x] Tests included
  • [x] Documentation is changed or added

nickt1512 avatar Aug 16 '19 22:08 nickt1512

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Aug 16 '19 22:08 CLAassistant

License info from https://github.com/PyThaiNLP/pythainlp/blob/dev/README.md

--> https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md

hagbard avatar Sep 11 '19 18:09 hagbard

Will CC-BY make it compatible with ICU terms?

As a contributor for PyThaiNLP, we (PyThaiNLP) can reconsider about the license of dataset to make it more usable for the wider community.

bact avatar Jun 26 '20 09:06 bact

Now, We (PyThaiNLP) change the license of dataset to CC-0. https://github.com/PyThaiNLP/pythainlp/releases/tag/v2.2.2

wannaphong avatar Jul 11 '20 04:07 wannaphong

Any updates on this? I'm using something that uses ICU to segment words but it seems to perform very poorly due to the limited dictionary.

artt avatar Jun 08 '21 10:06 artt

@nickt1512 you may like to update the word list from PyThaiNLP.

We recently found lots of misspellings in the dictionary, some are documented here https://github.com/PyThaiNLP/pythainlp/issues/557 .

Updated dictionary with corrections is already in PyThaiNLP dev branch.

bact avatar Jun 08 '21 20:06 bact

@artt you can also have your own custom dictionary.

icu4c provides gendict command line tool to convert a text file contains word list into a ICU dictionary format (using trie structure).

  • usage: https://manpages.debian.org/unstable/icu-devtools/gendict.1.en.html
  • source code: https://github.com/unicode-org/icu/tree/main/icu4c/source/tools/gendict

bact avatar Jun 08 '21 20:06 bact

@nickt1512 @bact

I looked into the build error issue and it seems like ICU doesn't like words with spaces in it. After removing 611 words (mostly with spaces for ไม้ยมก, and people's names), the build was successful. In case you'd like to update the PR. Thanks! thaidict_no_space.txt

artt avatar Jun 20 '21 04:06 artt

40000+ additions seems like a lot, on top of just 26000 words now. This seems excessive. Did someone measure the line breaking quality? By how much does it improve with this larger dictionary? How much could it improve with a smaller addition?

markusicu avatar Oct 27 '22 16:10 markusicu

40000+ additions seems like a lot, on top of just 26000 words now. This seems excessive. Did someone measure the line breaking quality? By how much does it improve with this larger dictionary? How much could it improve with a smaller addition?

Since the dictionary from PyThaiNLP tend to also include misspelled words (intentionally), I guess any line breaker that use the dictionary can be more robust when encountered those kind of words. But there's no comparison yet between the current ICU one and the PyThaiNLP one in terms of quality.

bact avatar Mar 09 '23 11:03 bact