dictpress icon indicating copy to clipboard operation
dictpress copied to clipboard

Proposal: Support for Tamil phonetic hashing using `tmphone`

Open Bowrna opened this issue 8 months ago • 4 comments

Hi @knadh,

Thank you for building and maintaining dictpress.

I’ve been working on tmphone, a Tamil phonetic hashing library inspired by knphone and mlphone. It focuses on generating consistent phonetic hashes for Tamil text, i.e, useful for fuzzy matching and search.

While I noticed that dictpress currently supports Kannada (alar) and Malayalam (olam) dictionaries with respective phonetic hashers, there’s no Tamil support yet. While I'm not proposing a Tamil dictionary right now, I’d like to propose tmphone as a potential hashing backend for future Tamil needs.

I'd like to hear your thoughts on integrating Tamil phonetic hashing using tmphone into Dictpress.

Thanks, Bowrna

Bowrna avatar May 15 '25 06:05 Bowrna

That's really cool @Bowrna. Once you've finished tmphone, we can integrate it here https://github.com/knadh/dictpress/blob/master/tokenizers/indicphone/indicphone.go.

The original MLphone algorithm works for almost all major Indic languages. It's currently implemented for Malayalam and Kannada. I've been meaning to formalize it into a single family named IndicPhone (for over a decade now, phew). Will get around to doing it at some point.

knadh avatar May 16 '25 05:05 knadh

Thanks for the update @knadh. Let me raise a PR to integrate after finishing. Yes, in a way, it can come around to support major Indic languages. Some of the things that I observed when working on this are:

  1. Regex part for regexKey1 in both Malayalam and Kannada ignores 3. While searching about it, I found it is anusvaram ( I don't read, write, and speak Malayalam or Kannada, while I can do all 3 in Tamil. So please point out in case I am assuming it wrongly). In Tamil, there is no concept of Anusvaram. We use pulli character heavily, and it is a silencer that removes any vowel sound in the consonant. So the regex part may vary here
  2. I see modVowels used in Kannada and Malayalam. However, vowels+ modifiers is invalid in Tamil. I am not sure how it works in other two languages. If there is any reason for including it, let me know.
  3. No special stop words like chillus in malayam are used in Tamil.
  4. The compounds in Tamil are distinct two letters, while both Malayalam and Kannada support it as a single cluster.

Bowrna avatar May 16 '25 18:05 Bowrna

Yes, that's correct. Those properties will vary from language to language, but the crux remains the same. Consistent phonetic hashes for glyphs (as is the nature of Indic languages; strong phonetic properties).

knadh avatar May 17 '25 06:05 knadh

@knadh could i try my hands on integrating all the existing pieces together?

Bowrna avatar May 30 '25 16:05 Bowrna