Feature: string normalization

Open erfan-rfmhr opened this issue 9 months ago • 1 comments

I recommend adding full normalization feature, particularly for Arabic to Persian conversion. This could be achieved by introducing a method that:

Fix text encoding issues using ftfy.
Converts Arabic characters to their Persian equivalents.
Normalizes numbers.
Replaces diacritic marks such as ُ ِ ّ أ via an additional dedicated method.

It can reduce the amount of codes in a project and let the library handle that.

I'm happy to implement this feature if you think it would be useful.

Looking forward to your feedback.

Apr 05 '25 12:04 erfan-rfmhr

@erfan-rfmhr Jan, Thanks for the thoughtful suggestion!

You're right that full normalization can be useful in many cases. One of the goals of this lib is to stay lightweight and dependency-free. So unless there's a compelling case, I’d prefer not to add external deps like ftfy. But if we can implement the same thing simply in the existing style, I’m all for it.

This library already supports a good chunk of normalization, particularly around Arabic-to-Persian character and number conversion: convert_ar_characters() handles common Arabic-to-Persian character replacementse: convert_ar_numbers() converts Arabic digits.

The part I'm unsure about is diacritics. If you can show a concrete use case for removing those (e.g. for search, comparison, etc), I’d be happy to consider a function like remove_diacritics() that strips those. Would be cool to keep that logic dependency-free too.

Next steps If you're still up for implementing a normalization helper that wraps existing functions plus adds optional diacritic removal (and maybe encoding fix if done without deps), that would be great. Let's keep it modular and lightweight.

Looking forward to your thoughts.

May 20 '25 13:05 rezkam