lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Add new token filters for Japanese sutegana (捨て仮名)

Open daixque opened this issue 1 year ago • 10 comments

Description

Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example:

  • "ストップウォッチ" is written as "ストツプウオツチ"
  • "ちょっとまって" is written as "ちよつとまつて"

So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc.

This pull request introduces 2 token filters:

  • JapaneseHiraganaUppercaseFilter for hiragana
  • JapaneseKatakanaUppercaseFilter for katakana

so that user can use either one separately. Each. filter make all the sutegana (small characters) into normal kana (uppercase character) to normalize the token.

Why it is needed

This transformation must be done as token filter. There have already been MappingCharFilter, but if we apply this character filter to normalize sutegana, it will impact to tokenization and it is not expected.

daixque avatar Dec 11 '23 14:12 daixque

From a Japanese perspective, the necessity sounds reasonable. Thank you for the contribution!

kojisekig avatar Dec 12 '23 00:12 kojisekig

Hi @mikemccand and @kojisekig, thank you for your reviews. I updated some codes along with the comments and add lines to module-info and resources to make gradle check green.

daixque avatar Dec 12 '23 01:12 daixque

Besides the optimization of manipulating the internal byte[] directly, I think this is good to go.

dungba88 avatar Dec 13 '23 11:12 dungba88

I did refactoring to apply a same kind of enhancement to Katakana filter as well.

daixque avatar Dec 16 '23 02:12 daixque

Looks great @daixque -- would you like to add a lucene/CHANGES.txt entry dscribing this awesome new capability? Be sure to put it under the 9.10.0 section since we can backport this change (it is not a 10.0.0-only feature).

mikemccand avatar Dec 18 '23 14:12 mikemccand

Looks great @daixque -- would you like to add a lucene/CHANGES.txt entry dscribing this awesome new capability? Be sure to put it under the 9.10.0 section since we can backport this change (it is not a 10.0.0-only feature).

@mikemccand This is done. Thanks!

daixque avatar Dec 18 '23 14:12 daixque

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jan 08 '24 12:01 github-actions[bot]

@mikemccand @dungba88 Let me ping. Do I still have anything to do for this PR? If not, could you merge it or let me know when will it be merged?

daixque avatar Jan 09 '24 08:01 daixque

I think it's good to go, but I don't have merge permission. Mike should be able to help you, otherwise you can try notify the dev mailing list as suggested by the bot

dungba88 avatar Jan 09 '24 10:01 dungba88

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jan 24 '24 00:01 github-actions[bot]