lucene
lucene copied to clipboard
Add new token filters for Japanese sutegana (捨て仮名)
Description
Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example:
- "ストップウォッチ" is written as "ストツプウオツチ"
- "ちょっとまって" is written as "ちよつとまつて"
So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc.
This pull request introduces 2 token filters:
- JapaneseHiraganaUppercaseFilter for hiragana
- JapaneseKatakanaUppercaseFilter for katakana
so that user can use either one separately. Each. filter make all the sutegana (small characters) into normal kana (uppercase character) to normalize the token.
Why it is needed
This transformation must be done as token filter. There have already been MappingCharFilter, but if we apply this character filter to normalize sutegana, it will impact to tokenization and it is not expected.
From a Japanese perspective, the necessity sounds reasonable. Thank you for the contribution!
Hi @mikemccand and @kojisekig, thank you for your reviews.
I updated some codes along with the comments and add lines to module-info and resources to make gradle check
green.
Besides the optimization of manipulating the internal byte[] directly, I think this is good to go.
I did refactoring to apply a same kind of enhancement to Katakana filter as well.
Looks great @daixque -- would you like to add a lucene/CHANGES.txt
entry dscribing this awesome new capability? Be sure to put it under the 9.10.0
section since we can backport this change (it is not a 10.0.0-only feature).
Looks great @daixque -- would you like to add a
lucene/CHANGES.txt
entry dscribing this awesome new capability? Be sure to put it under the9.10.0
section since we can backport this change (it is not a 10.0.0-only feature).
@mikemccand This is done. Thanks!
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
@mikemccand @dungba88 Let me ping. Do I still have anything to do for this PR? If not, could you merge it or let me know when will it be merged?
I think it's good to go, but I don't have merge permission. Mike should be able to help you, otherwise you can try notify the dev mailing list as suggested by the bot
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!