takuya kodama comments

Results 124 comments of


                                            takuya kodama

NormalizerNFKC: add an option to remove diacritical mark

なるほどです。ありがとうございます！確かにユースケースとしては、アルファベットに発音区別符号を含んだ文字が多いと思うので、一旦その方針で対応してい行きたいと思います。 --- となると合成済み文字としては、ラテン文字がメインになってくると思うので、下記のサイトを参考に「Latin」内のセクションごとに対応していくイメージでタスクを分割して見ようと思います。 - https://www.unicode.org/charts/

NormalizerNFKC: add an option to remove diacritical mark

発音区別符号を取り除いて、基底文字がアルファベットになるのは 489 になる。 ```ruby def have_diactritical_combining_character?(character) code_points = character.unicode_normalize(:nfd).codepoints code_points.any? do |code_point| (0x0300..0x036f).cover?(code_point) end end def base_character_is_alphabet?(character) base_character = character.unicode_normalize(:nfd).chars.first base_code_point = base_character.codepoints.first (0x0041..0x005A).cover?(base_code_point) || (0x0061..0x007A).cover?(base_code_point) end total_count = 0...

NormalizerNFKC: add an option to remove diacritical mark

合成後に対応するようなやつがラテン文字の仕様を列挙する。その仕様ごとに実装とテスト追加をしていきたい。

NormalizerNFKC: add an option to remove diacritical mark

上記のサンプルコードの実行結果を元に選定して、下記の合成済み文字コードを対応すれば、diacritical markが、ついたアルファベットは網羅できそう。 - [Latin-1 Supplement](https://www.unicode.org/charts/PDF/U0080.pdf) - [Latin Extended-A](https://www.unicode.org/charts/PDF/U0100.pdf) - [Latin Extended-B](https://www.unicode.org/charts/PDF/U0180.pdf) - [Latin Extended-C](https://www.unicode.org/charts/PDF/U2C60.pdf) - [Latin Extended Additional](https://www.unicode.org/charts/PDF/U1E00.pdf)

NormalizerNFKC: add an option to remove diacritical mark

I've just updated the plan as follows. ## Implementation plans - [x] Add `unify_alphabet_diacritical_mark` option to `NormalizerNFKC` - We won't implement the logic but just the option interfance. - [...

NormalizerNFKC: add an option to remove diacritical mark

> Add unify_alphabet_diacritical_mark option to NormalizerNFKC > - We won't implement the logic but just the option interfance. I will start implementing this one now.

NormalizerNFKC: add an option to remove diacritical mark

下記に `unify_alphabet_diacrtical_mark` option を指定して問題ないことを確認するテストを追加する。 - https://github.com/groonga/groonga/pull/1831

NormalizerNFKC: add an option to remove diacritical mark

```console $ /tmp/local/bin/groonga --version Groonga 14.0.6-7-g03e3053 [Linux,x86_64,utf8,match-escalation-threshold=0,nfkc,mecab,message-pack,onigmo,zlib,lz4,zstandard,epoll,rapidjson,xxhash] $ grntest ./test/command/suite/normalizers/nfkc/unify_alphabet_diacritical_mark.test --groonga=/tmp/local/bin/groonga ``` ## 実際の結果あくまでオプションを指定できるようにするだけで値を変えることを意図していないので確認する。 ``` normalize 'NormalizerNFKC("unify_alphabet_diacritical_mark", true)' "À" WITH_TYPES [ [ 0, 0.0, 0.0 ], { "normalized": "\u0001\u0000",...

NormalizerNFKC: add an option to remove diacritical mark

レビュー対応をしていく。

NormalizerNFKC: add an option to remove diacritical mark

> Implement the logic of unify_alphabet_diacritical_mark with tests and documents for the following precomposed characters. 下記の対応をしていく！ - [ ] [Latin-1 Supplement](https://www.unicode.org/charts/PDF/U0080.pdf)