BLKSerene comments

Results 45 comments of


                                            BLKSerene

Distinguish between different variations of the same language

The `conllu` package should suffice for parsing UD corpora: https://github.com/EmilStenstrom/conllu

Acronyms with periods at the end of the sentence are tokenized incorrectly

@arademaker The problem concerns with the word tokenizers, not the Punkt sentence tokenizer. Are they based on the same algorithms?

Acronyms with periods at the end of the sentence are tokenized incorrectly

In my use cases, I do not use `nltk.word_tokenize` (which would call `nltk.sent_tokenize` first). I call `TreebankWordTokenizer` and `NLTKWordTokenizer` directly for word tokenization task (sentence tokenization, if needed, is handled...

[Feature Request] Thai word detokenizer

@wannaphong There are no spaces between Thai words, only between sentences, am I right? I do not speak Thai, so I can't give examples here. But when Thai words are...

[Feature Request] Thai word detokenizer

@wannaphong Is `clause_tokenize` required to get the correct detokenized string? If so, perhaps `clause_tokenize` could be called implicitly inside `word_detokenize`?

[Feature Request] Thai word detokenizer

Since I do not speak Thai, I'm a bit confused about some points. The input of `word_detokenize` could be either list of tokens (strings) or list of sub-lists of strings...

[Feature Request] Thai word detokenizer

And, how are English words (and words in other Indo-European languages) handled in `word_detokenize` (spaces added between Thai and English words or not)?

Frequent crashes for some operations (N-gram, etc)

There is a bug for the macOS version which should be fixed in the next version. You may use the Windows version now.

Frequent crashes for some operations (N-gram, etc)

Fixed in [2.3.0](https://github.com/BLKSerene/Wordless/releases/tag/2.3.0), please take a try.

请问可否添加dpi适应？

高分屏的字体适配问题后续版本会逐步尝试解决