BLKSerene
BLKSerene
The `conllu` package should suffice for parsing UD corpora: https://github.com/EmilStenstrom/conllu
@arademaker The problem concerns with the word tokenizers, not the Punkt sentence tokenizer. Are they based on the same algorithms?
In my use cases, I do not use `nltk.word_tokenize` (which would call `nltk.sent_tokenize` first). I call `TreebankWordTokenizer` and `NLTKWordTokenizer` directly for word tokenization task (sentence tokenization, if needed, is handled...
@wannaphong There are no spaces between Thai words, only between sentences, am I right? I do not speak Thai, so I can't give examples here. But when Thai words are...
@wannaphong Is `clause_tokenize` required to get the correct detokenized string? If so, perhaps `clause_tokenize` could be called implicitly inside `word_detokenize`?
Since I do not speak Thai, I'm a bit confused about some points. The input of `word_detokenize` could be either list of tokens (strings) or list of sub-lists of strings...
And, how are English words (and words in other Indo-European languages) handled in `word_detokenize` (spaces added between Thai and English words or not)?
There is a bug for the macOS version which should be fixed in the next version. You may use the Windows version now.
Fixed in [2.3.0](https://github.com/BLKSerene/Wordless/releases/tag/2.3.0), please take a try.
高分屏的字体适配问题后续版本会逐步尝试解决