CotEditor icon indicating copy to clipboard operation
CotEditor copied to clipboard

Word count is not accurate for non-English languages

Open liuhantang opened this issue 5 years ago • 5 comments

This is a great editor, I really love it. However, the word count for non-English languages seems to be not correct, e.g., Chinese. For example, "加入中文字符" should be 6 words (or 6字 by Chinese) while the editor shows 4. Will we have a fix in the future?

liuhantang avatar Nov 11 '20 02:11 liuhantang

Let me know your macOS version, CotEditor's version, as well as your language setting.

As for code, CotEditor lets a macOS framework count words by assumpting the text is written in the current system language. On my current setting (macOS 10.15.7, CotEditor 4.0.0, English) at least, it counts "加入中文字符" as 3 words, namely 加入, 中文字, and 符. Furthermore, I guess it is not wrong though I'm not sure since I'm not a native Chinese speaker.

Screen Shot 2020-11-11 at 13 34 24

IMPO, counting words in the languages that don't break words with space is difficult for a computer in general. It is also helpful for debugging if you give me examples of macOS apps that count Chinese correctly.

1024jp avatar Nov 11 '20 04:11 1024jp

Thank you for your reply. Here's my setting: macOS version: Catalina 10.15.7 CotEditor's version: 4.0.0 language setting: simplified Chinese. Currently, CotEditor correctly counts the number of characters. The problem seems to be how they define a 'word'. In my opinion, one single Chinese character should be treated as a 'word' (in the sense of English words). So "加入中文字符" should be "加 入 中 文 字 符" 6 words. macOS's counting words framework seems to count phrases (which can consists of multiple characters) instead of words. They may treat '中文' as one word instead of 2. I tried to find a clue of how they do segmentation, yet the result seems to be somewhat random. The goal is actually simple, just count each Chinese character as a single word. As far as I know, Typora does this job perfectly, even when you mix input English and Chinese.

liuhantang avatar Nov 11 '20 06:11 liuhantang

Thank you for your environment info and explanation. I'll look for how it can be improved. Let me have time.

1024jp avatar Nov 11 '20 09:11 1024jp

To add another data point: Apple's Pages app seem to agree this text is 6 words.

Apple frameworks follow Unicode segmentation rules, using the ICU library underneath, which does split this text into 3 "words". You can try it here. On that demo page, if you choose "line" instead (for places where the line can be wrapped) it appears to work mostly as desired, except for empty lines that count as a word.

(Edit: I don't know Chinese and have no opinion on what should be counted as a word; this is only a technical suggestion on how to achieve the requested result.)

michelf avatar Nov 11 '20 11:11 michelf

Currently, CotEditor correctly counts the number of characters. The problem seems to be how they define a 'word'. In my opinion, one single Chinese character should be treated as a 'word' (in the sense of English words). So "加入中文字符" should be "加 入 中 文 字 符" 6 words. macOS's counting words framework seems to count phrases (which can consists of multiple characters) instead of words. They may treat '中文' as one word instead of 2. I tried to find a clue of how they do segmentation, yet the result seems to be somewhat random. The goal is actually simple, just count each Chinese character as a single word. As far as I know, Typora does this job perfectly, even when you mix input English and Chinese.

It is perfectly legal to treat "加入中文字符" as four words or fewer, especially considering that some of these characters form compound words or phrases when used together.

So just to clarify, is the original request about counting individual characters? Basically number of characters on screen that would give you "4" if you processed a word like "nice"? Or am I missing something?

I don't have anything against counting individual characters, but that's not counting words. If words were counted as requested, then words like "漂亮" or "舒服" would be counted as 2 words each (for a total of 4 "words") - under normal circumstances, I think absolutely everyone would count these as 1 word each (so 2 words altogether) (same in Traditional Chinese). There are lots of words like that in Chinese. 说明,确定,内容… two characters but one word. Almost any country name for example, 日本,美国,加拿大,(and some are even longer, for example Malaysia 马来西亚, who would ever count this as 4 words?) are all one word. (edit: fixed numbers, I can't count up to 4)

inhalt120g avatar Dec 18 '23 07:12 inhalt120g