Metadata length validation is buggy for unicode strings
The specification specifically says that we must validate the number of characters (looks like graphemes would be even a more correct term).
Currently scraperlib is using the len function which is not counting the number of graphemes (what we want to validate because they are the visually perceived thing) but the number of code points (which is not what is visually perceived).
Looks like (according to ChatGPT, let's be honest) we could use the grapheme library. Not sure this is the appropriate idea since this lib seems barely maintained / released in a proper manner.
import grapheme
print(len("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में")) # Outputs: 41 => Wrong
print(grapheme.length("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में")) # Outputs: 25 => Correct
Another alternative (ChatGPT again) which seem a bit more maintained is the regex library (not the re standard library)
import regex
len(regex.findall(r'\X', text)) # Outputs: 25 => Correct
https://pypi.org/project/regex/
We've used the regent approach and an icu based one in the past. I think it's mandatory to clarify spec requirement first. Matches one of the topic of the hackathon as well!
Yesterday discussions in the Hackathon were quite clear, but I can only agree that requirement must be clear first ^^
Then that's OK. Would be good to add clarity to the spec Wiki as well.
Spec has been updated during the hackathon to specify that we need to count graphemes: https://wiki.openzim.org/wiki/Metadata