stroke-input-data
stroke-input-data copied to clipboard
The stroke order for 里, both as an indedendent character and as a component of others should be 2511211
The ArchChinese and zdic websites and my Chinese learning material all give the stroke order for 里 as 2511211 not 2511121 as in your data.
Interesting. Ending 里 in 211 is a cursive-based order.
Looks like this is a point of difference specific to the Mainland standard.
HK and TW standards end in 121:
- https://www.edbchinese.hk/lexlist_en/
- https://stroke-order.learningweb.moe.edu.tw/charactersQueryResult.do?words=里&lang=en
Impacts in the order of 300 characters (based on a grep estimate which probably contains false positives):
$ grep 2511121 codepoint-character-sequence.txt | wc -l
380
I read somewhere that the simplified versions of characters are based on cursive script, with some input from the traditional script. So it seems logical that they should use a cursive-based stroke order for some characters.
The way you have things set up at the moment you have to edit codepoint-character-sequence.txt by hand which is both onerous and error-prone. It might be an idea to generate it automatically, in such a way that you could edit the stroke order for 里 in one place. In that way you could add the extra option very easily. I don’t expect you to make the required changes to your excellent project, because, as you say, you are not a programmer and it would take a lot of work which only a nerd like me might be prepared to do. However this also suggests a solution to your data loading problem. What I think I will do is clone your project and work on it. If I come up with anything useful (no promises!) I will feed it back to you.
So it seems logical that they should use a cursive-based stroke order for some characters.
Then they should have applied this consistently, and also made 王 etc. end in 211. (That's what the Japanese standard does.)
It might be an idea to generate it automatically, in such a way that you could edit the stroke order for 里 in one place.
When I started on this project, I also considered whether it would be feasible to encode characters as combinations, and thereby eliminate manual repetition of data for components. I quickly ran into the problem of Unicode being a complete shitfest when it comes to consistency.
For example, consider the component U+5F55 录 (simplified) vs U+5F54 彔 (traditional). We have the following three categories:
- Characters with the simplified component: 剥录渌禄緑绿録𫘧
- Characters with the traditional component: 剝彔淥祿綠錄
- Characters with either component (what you see will depend on the font you have installed): 娽椂氯琭盝睩碌箓簶籙粶菉觮趢逯邍醁騄鵦龣㖨㟤㪖㫽㯟䃗䎑䎼䐂䘵䚄䟿䩮䰁䱚䴪
If things were consistent, you would either (a) only have categories 1 & 2, or (b) only have category 3. Having all three categories simultaneously is just stupid. Anyway, I concluded that any effort needed to resolve such discrepancies (if trying to encode characters as combinations) is better spent just manually building the data set.
Up to this point, my error rate has been ~20/28k = 0.07% (mainly due to issue #2). There are probably more errors, but I do not expect the total rate to exceed 0.5%. I'm satisfied with that.
Now back to the original post, which is a systemic issue rather than a transcription error. 121 vs 211 also affects 黑. So overall we have at most 515 characters affected:
$ grep -P '25(1|43|\(1\|43\))1121' codepoint-character-sequence.txt | wc -l
515
This is doable for me, but it's not very high priority. I'm happy to leave this issue open until I get around to fixing it some day.
Was struggling with 里 today as well, and came here to open an issue: Glad that this is already in discussion. Very interesting discussion btw, learned something again, thank you both!
Hello @Greenactivist and @hubortje, sorry for the two-year delay!
I finally got around to allowing 211 in addition to 121 for 里, 黑, etc. (specifically, commits 6e1bcc205e283ee85ef4aa801c7077f3ba41efe7, ebedf6f8879dd8b5caf53c7a10eb6ebf1d70255f, and 0937f2c6e4f9f351e431e1223bba16eb75249d0f).
I will propagate the changes to the Android app later tonight. Expect the next release to be live on Google Play within a week and on F-Droid in about 2 weeks.
Let me know if I've missed a character or made a mistake somewhere.