PowerToys
PowerToys copied to clipboard
[TextExtractor] fix error blanks in japanese OCR
Summary of the Pull Request
PR Checklist
- [x] Closes: #22208
- [ ] Communication: I've discussed this with core contributors already. If work hasn't been agreed, this work might be rejected
Detailed Description of the Pull Request / Additional comments
This PR is to point out the problem of the pr #20926 and give a patch.
The following submissions are relevant.
#20415 #20926
In pr #20926, the code add a blank in all symbol which is not in the CJK Unified Ideographs, but CJK Unified Ideographs just contains a small part symbols which do not need blank.
Like Kanji ,Hiragana, Katakana, Hankaku-Katakana, they all do not need blanks. Korean OCR engine behavior and rules are different from ZH and JP. So I just add JP and ZH. (combie pr #20926 and #20415)
And there may be more symbol that do not need blanks but is frequently-used like \U+3001 \U+3002. Maybe it's better to be solved by the OCR engine? OR consider to Rewrite related modules?
Validation Steps Performed
Japanese
origin pic
original error result
そ の 後 、 2 人 の 住 む 関東 は 本格的 に 梅雨入 り し 、 雨 の 日 の 午前 た け の さ さ や か
な 交流 が は し ま る 。 タ カ オ は 心 に 秘 め て い た 靴職人 に な る 夢 を 語 り 、 あ る 理由
か ら 味覚障害 を 患 う ユ キ ノ は タ カ オ の 作 る 弁当 の 料理 に 味 を 感 し る よ う に な
る 。 ユ キ ノ は 弁当 の お 礼 に と タ カ オ に 「 靴作 り の 本 」 を プ レ セ ン ト し 、 タ カ オ
は ユ キ ノ の た め に 靴 を 作 る こ と に 決 め 、 お そ る お そ る 彼女 の 足 を 採寸 す る 。
results now (fixed)
その後 、 2 人の住む関東は本格的に梅雨入りし 、 雨の日の午前たけのささやか
な交流がはじまる 。 タカオは心に秘めていた靴職人になる夢を語り 、 ある理由
から味覚障害を患うユキノはタカオの作る弁当の料理に味を感しるようにな
る 。 ユキノは弁当のお礼にとタカオに 「 靴作りの本 」 をプレセントし 、 タカオ
はユキノのために靴を作ることに決め 、 おそるおそる彼女の足を採寸する 。
Chinese
origin pic
ans
故事发生的地点是在每干年回归一次的彗星造访过一个月之前 , 日本飞驊市的乡下小
町糸守町 。 在这里女高中生三叶每天都过着忧郁的生活 , 而她烦恼的不光有担任町长的父
亲所举行的选举运动 , 还有家传神社的古老习俗 。 在这个小小的町 , 周围都只是些爱瞎操
心的老人 。 为此三叶对于大都市充满了憧憬 。
English
origin pic
ans
FoIIowing summer break, Takao returns tO school and spots Yukari, discovering that she is a literature teacher and
had been the target Of gossip and bullying. TO avoid further confrontation, Yukari opted tO avoid work and retreat tO
the park, hoping she would learn tO overcome her fears. However, she quits her jOb and leaves the school. That
afternoon, Takao meets Yukari at the park and greets her by reciting the 2 , 5 14th poem from the ManIyöshü
Japanese poetry collection, the correct response tO her tanka, which he found in a classic Japanese literature
textbook. Aft e r getting soaked by a sudden thunderstorm, bOth head tO YukariIs apartment and spend the afternoon
together. When Takao confesses his love, Yukari is moved, but reminds him that she is a teacher and that she is
moving back tO her hometown on Shikoku. After Ta kao excuses himself, Yukari realizes her mistake and runs after
him. StiII upset, Ta kao angrily takes back what he had said and criticizes her fO r being SO secretive and never
opening up tO him. Yukari embraces him and the tWO cry while she explains that their time together in the park had
saved her.
For Unicode 3000 ~ 303F IsCJKSymbolsandPunctuation. In Chinese and Japanese, they are no blanks, but if we want to include all symbols, it will be pretty difficult to judge (eg. https://github.com/TheJoeFin/Text-Grab/issues/191 is still not perfect, and have some mistakes). So just add the most important Hiragana, Katakana, Hankaku-Katakana.
In ZH and JP, symbol like "、", ",", “。”, still have two blanks, because the regex pattern string don't include them.
also OK
var cjkRegex = new Regex(@"\p{IsCJKUnifiedIdeographs}|\p{IsHiragana}|\p{IsKatakana}|[\uFF61-\uFF9F]|[\u3000-\u3003]");
@check-spelling-bot Report
:red_circle: Please review
See the :open_file_folder: files view or the :scroll:action log for details.
Unrecognized words (2)
Hankaku symble
Previously acknowledged words that are now absent
brucelindbloom chromaticities companding Eqn ffaa :arrow_right:To accept :heavy_check_mark: these unrecognized words as correct and remove the previously acknowledged and now absent words, run the following commands
... in a clone of the [email protected]:AO2233/PowerToys.git repository
on the main
branch (:information_source: how do I use this?):
curl -s -S -L 'https://raw.githubusercontent.com/check-spelling/check-spelling/v0.0.21/apply.pl' |
perl - 'https://github.com/microsoft/PowerToys/actions/runs/3612769556/attempts/1'
Available :books: dictionaries could cover words not in the :blue_book: dictionary
This includes both expected items (2140) from .github/actions/spell-check/expect.txt and unrecognized words (2)
Dictionary | Entries | Covers |
---|---|---|
cspell:cpp/src/cpp.txt | 30216 | 121 |
cspell:win32/src/win32.txt | 53509 | 116 |
cspell:python/src/python/python-lib.txt | 3873 | 31 |
cspell:php/php.txt | 2597 | 16 |
cspell:node/node.txt | 1768 | 14 |
cspell:typescript/typescript.txt | 1211 | 12 |
cspell:python/src/python/python.txt | 453 | 10 |
cspell:java/java.txt | 7642 | 10 |
cspell:aws/aws.txt | 218 | 8 |
cspell:r/src/r.txt | 808 | 7 |
Consider adding them using (in .github/workflows/spelling2.yml
):
with:
extra_dictionaries:
cspell:cpp/src/cpp.txt
cspell:win32/src/win32.txt
cspell:python/src/python/python-lib.txt
cspell:php/php.txt
cspell:node/node.txt
cspell:typescript/typescript.txt
cspell:python/src/python/python.txt
cspell:java/java.txt
cspell:aws/aws.txt
cspell:r/src/r.txt
To stop checking additional dictionaries, add:
with:
check_extra_dictionaries: ''
If the flagged items are :exploding_head: false positives
If items relate to a ...
-
binary file (or some other file you wouldn't want to check at all).
Please add a file path to the
excludes.txt
file matching the containing file.File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.
^
refers to the file's path from the root of the repository, so^README\.md$
would exclude README.md (on whichever branch you're using). -
well-formed pattern.
If you can write a pattern that would match it, try adding it to the
patterns.txt
file.Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines.
Note that patterns can't match multiline strings.
The correct spell for 'symble' is symbol, sorry for the misspelling! The code comments for ease of inspection, you could remove it. The word 'hankaku-katakana' is the meaning of 'Half-width kana' or '半角カナ' or '半角假名'. Here is the wiki url.
fixed spelling mistakes in your branch
well, 1 spelling and added Hankaku to exception list
@microsoft-github-policy-service agree
/azp run
Azure Pipelines successfully started running 1 pipeline(s).