Remove Telugu normalization of vu వు to ma మ from IndicNormalizer
Description
Telugu vu వు and ma మ are visually similar—akin to English "rn" and "m"—but they should not be conflated. Names like వెంకటరామ (Venkatarama) and వెంకటరావు (Venkatarao) and words like మండే and వుండే (links to Telugu Wiktionary) are distinct.
It's like conflating "rn" and "m" to merge burn/bum and corn/com. It could happen when reading quickly or with poor handwriting, but it is not something that should happen for search indexing.
I notice that some of the Telugu elements of IndicNormalizer are in TeluguNormalizer, but this mapping is not—which is good!
(Sorry for the botched pull request. Obviously this change would also affect some tests, which need to be updated or re-evaluated.)
Version and environment details
My version: "distribution" : "opensearch", "number" : "1.3.20", "lucene_version : "8.10.1"
Running on x86_64 GNU/Linux in Docker 4.15.0 on MacOS 13.6.3.
It's like conflating "rn" and "m" to merge burn/bum and corn/com. It could happen when reading quickly or with poor handwriting, but it is not something that should happen for search indexing.
If you read the referenced documents, these mappings are specifically for this exact purpose. It solves technical issues of graphical vs logical order with fonts. It sounds like you don't want this: if you have perfect unicode text from wikipedia that doesn't suffer from such damage, don't use this filter as you will find more mappings you don't like.
The problems dealt with by the filter happen most often with text written in legacy fonts, extracted from PDF, etc, etc. In such cases, the foldings are essential: the improvements can be seen (and measured) in FIRE IR benchmarks.
The filter here is working as documented: problem is, user didn't read the documentation. Just don't use the filter if you dont want the transformations that it does.
@praveen-d291: Thanks for the pull request! I was unsure how best to modify the tests since I don't read Telugu. I couldn't tell what would make natural-looking examples and I didn't want to further impose on the native speaker I have been working with to look at unit tests, so thank you for putting your overlapping knowledge of Telugu and Java to good use for others. I hope someone approves it!
@rmuir: You have used "working as documented" as a thought-terminating cliché before. Just because something is accurately documented doesn't mean it is the right thing to do.
The referenced document is also no longer at the URL given in the code—and, based on the Wayback Machine, hasn't been for years—which will keep many people from finding and referencing it. However, I did find the new location of the current version (also thanks to the Wayback Machine):
http://languagelog.ldc.upenn.edu/myl/ldc/IndianScriptsUnicode.html
A few thoughts:
- Is text written in legacy fonts, extracted from PDF, etc. the most common use case for Telugu text indexed by Lucene these days? I get that the specific mapping improves recall for poorly curated text, but it does so at the cost of precision. Both Praveen above and the native speaker I've been working with don't seem to think this one mapping is useful. I originially questioned it because I—as a moderately attentive non-speaker—can see the difference between the characters in all but one of the Telugu-capabale fonts I have, and in all of the Telugu-specific fonts I have—Arial Unicode being the one where it is visually ambiguous. That's very different from other mappings like బ + ు + ు (బుు) → ఋ, where there is no visual distinction between the result, and the non-canonical version makes no sense on its own ("buu" should be బ + ూ = బూ).
- According to the referenced document's History section, it hasn't been updated since 1998. Technology moves fast, so it seems reasonable to review data sources and assumptions about content at least once every quarter century.
- Also, if you read the referenced document carefully, the relevant mapping seems to only be included for some expansive notion of completeness: it's parenthesized unlike any other mapping, and the associated comment in the referenced document explicitly says that that "MA [0C2E] will not be confused" with VU [0C35+0C41] because there is special rendering to make them distinct (modulo Arial Unicode). You should be able to see the referenced difference in rendering in the title of this ticket.
(U+0C2E 0C35 0C41 TELUGU LETTER MA will not be confused,
as the script uses a special rendering
of 0C41 in this case. The same is
done in several other appearant cases.)
I read that as indicating that including the VU/MA mapping is an error. There may be other cases, as the comment suggests, but this one is high-frequency enough that it bubbled to the top in my anaysis of our content, across several languages.
"Don't use it if you don't like it" is another thought-terminating cliché you've used before. I have the wherewithal to do exactly that, but that's not why I'm here. I'm lucky to have the time and ability to do a detailed analysis of the effects of the components of various language analyzers on our content, which is often varied and voluminous, and I can usually find willing native speakers to help me untangle the more questionable or confusing bits. I can also write my own plugins, configure custom filters, and tweak anything and everything I need to. Not every organization using Lucene has the ability to do that, so I try to upstream generally applicable knowledge or improvements for the users who don't have the time, technical skill, and access to the language knowledge needed to customize their own deployments.
Even though I can, forking and/or re-implementing the 99+% of indic_normalization that does good things is a brittle approach that cuts me off from future improvements and upgrades, and adds an unneeded maintenance burden to my deployment. I'd rather try to improve indic_normalization for everyone, or at least have a conversation about current vs historical trends in computing and content for the relevant language/script,[*] think about the trade-offs of recall and precision for the particular mapping, incorporate thoughts and insights from speakers of the language, and improve everyone's understanding of the current needs and wants of searchers and readers.
[*] I'd appreciate a link to the FIRE IR benchmarks you had in mind. A quick online search only revealed discussion of a Telugu Named Entity Recognition dataset.
Instead, user received another abrasive and dismissive termination of discussion, which reminded user why user has not always shared[†] other generally applicable knowledge or ideas for improvements.
[†] For example, user has previously noted that
bengali_normalizationuses a phonetic algorithm with much too aggressive compression for search, and verified this fact with native Bangla speakers. User recognizes that it works as documented, and since user did not like it, user does not use it—as user would expect to be advised. However, user felt bad for not trying to upstream this information to improve Bangla search for others. Now user feels less bad because the attempt would also likely have been rejected.
Hey @rmuir ,
Thanks for the explanation. I've been thinking about the TeluguAnalyzer's default behavior, and I believe we have a significant hidden issue. The analyzer bundles IndicNormalizationFilter, which implicitly converts వు -> మ. This conflation isn't documented anywhere within TeluguAnalyzer, so users won't realize it's happening. Even the link in IndicNormalizer.java (http://ldc.upenn.edu/myl/ldc/IndianScriptsUnicode.html) is no longer accessible. This behavior is going to confuse any native speaker.
The bundled Telugu Analyzer is assuming that most users are going to use Lucene on text written in legacy fonts, extracted from PDF, etc. But, that might not be true now with unicode support for Telugu as is. I have two options in my mind..
Option 1: Fix the Default (My Preference) I'd propose adding a boolean option to the TeluguAnalyzer constructor to control IndicNormalizationFilter inclusion, and make its default false. This would make TeluguAnalyzer precise right out of the box for modern documents. Users with older, less-formatted text could still explicitly enable it. I believe this is a necessary correction for linguistic accuracy and explicitly document this conversion.
Option 2: Document the behavior in TeluguAnalyzer Alternatively, we could document this specific behavior in the TeluguAnalyzer docs, explaining the వు to మ mapping and how to build a custom analyzer to avoid it.
Option 1 feels like the right long-term fix for the default user experience. What do you think? I can raise a PR after agreeing on this topic.
As a native Telugu speaker who loves Lucene, I'm keen to help out!
cc @Trey314159
If you re-read the description, I think you'll understand why i responded the way I did. To me it reads as, there isn't an understanding of the purpose of this filter, or the reasons why text could have these problems. It comes across as "I'm linguist, I'm native speaker, these characters are different, this is wrong!!!" without any actual data/homework done, and it leaves all the "homework" to the maintainers.
This isn't meant as an attack on you, don't take it the wrong way, i'm just stating how I read it. There was never a shortage of native speakers here, instead a shortage of correct unicode :)
If the issue was written differently (this is just an EXAMPLE), it would allow making progress and improvements without draining a lot of time:
EXAMPLE:
Computerized text in this language has advanced in the last decade: e.g. content is now generally in (correct) Unicode, you don't have to download custom fonts from websites to read the text, nor are they rendering text as images/PDF, nor are they doing janky conversion from 8-bit fonts. Client OS can render it properly, e.g. Uniscribe renders complex scripts on Windows without checking special boxes or installing language packs, mobile phones work correctly etc. I did quick-n-dirty basic analysis with wget and regular expressions of a sample of common government/news sites, and confirmed text generally has correct unicode: we can tone it down. As a safe step, first remove too-aggressive rules (e.g. that conflate different consonants), these cause more harm than good for "good text".
@rmuir,
You're absolutely right; I should have led with this data in my initial comment.
Here's a direct look at the state of modern Telugu content, which strongly suggests that the issues the IndicNormalizationFilter was designed to address are less prevalent now:
- Prevalence of Clean Unicode Text: I've analyzed several high-volume, real-world Telugu sources, and the trend towards clean Unicode is very clear across these examples:
- The official website of the Government of Telangana: https://www.telangana.gov.in/te/
- The Andhra Pradesh Government's Irrigation Department website: https://irrigationap.cgg.gov.in/wrd/home
- The Andhra Pradesh Agriculture Department website: https://www.apagrisnet.gov.in/
- A major Telugu news publication like Eenadu: https://www.eenadu.net/ (consistently a top 3 paper by circulation).
All content on these sites consistently uses UTF-8 Unicode. Characters like వు (vu) and మ (ma) are rendered distinctly and unambiguously.
- Widespread OS-Level Font Support: The need for "custom fonts from websites" or "janky conversion" is largely gone because popular OS vendors have been bundling robust Telugu font support for over two decades:
Windows: Gautami has been included since 2001 (https://en.wikipedia.org/wiki/Gautami_(typeface)). Nirmala UI, a comprehensive typeface for Indic scripts, has been bundled since Windows 8 (https://en.wikipedia.org/wiki/Nirmala_UI). macOS: macOS Monterey alone includes 15 Telugu fonts (Apple support page: https://support.apple.com/en-in/103203). This widespread, native OS support directly translates to users generally not dealing with systems that require special handling or struggle with complex script rendering for modern Unicode Telugu text.
The core issue is that applying the వు to మ conflation by default now introduces a linguistically incorrect loss of precision for the vast majority of current Telugu content. Given this, I want to reiterate the two options I proposed earlier for addressing this:
Option 1: Fix the Default (My Preference) I'd propose adding a boolean option to the TeluguAnalyzer constructor to control IndicNormalizationFilter inclusion, and make its default false. This would make TeluguAnalyzer precise right out of the box for modern documents. Users with older, less-formatted text could still explicitly enable it. I believe this is a necessary correction for linguistic accuracy and explicitly documents this conversion.
Option 2: Document the behavior in TeluguAnalyzer Alternatively, we could document this specific behavior in the TeluguAnalyzer docs, explaining the వు to మ mapping and how to build a custom analyzer to avoid it.
Option 1 feels like the right long-term fix for the default user experience, given the current state of Telugu content. What do you think? I can raise a PR after agreeing on this topic.