python-scraperlib icon indicating copy to clipboard operation
python-scraperlib copied to clipboard

Metadata length validation is buggy for unicode strings

Open benoit74 opened this issue 1 year ago • 4 comments

The specification specifically says that we must validate the number of characters (looks like graphemes would be even a more correct term).

Currently scraperlib is using the len function which is not counting the number of graphemes (what we want to validate because they are the visually perceived thing) but the number of code points (which is not what is visually perceived).

Looks like (according to ChatGPT, let's be honest) we could use the grapheme library. Not sure this is the appropriate idea since this lib seems barely maintained / released in a proper manner.

import grapheme

print(len("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में"))  # Outputs: 41 => Wrong
print(grapheme.length("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में"))  # Outputs: 25 => Correct

benoit74 avatar Apr 30 '24 20:04 benoit74

Another alternative (ChatGPT again) which seem a bit more maintained is the regex library (not the re standard library)

import regex
len(regex.findall(r'\X', text))   # Outputs: 25 => Correct

https://pypi.org/project/regex/

benoit74 avatar Apr 30 '24 20:04 benoit74

We've used the regent approach and an icu based one in the past. I think it's mandatory to clarify spec requirement first. Matches one of the topic of the hackathon as well!

rgaudin avatar Apr 30 '24 22:04 rgaudin

Yesterday discussions in the Hackathon were quite clear, but I can only agree that requirement must be clear first ^^

benoit74 avatar May 01 '24 08:05 benoit74

Then that's OK. Would be good to add clarity to the spec Wiki as well.

rgaudin avatar May 01 '24 09:05 rgaudin

Spec has been updated during the hackathon to specify that we need to count graphemes: https://wiki.openzim.org/wiki/Metadata

benoit74 avatar Jul 02 '24 06:07 benoit74