ajv
ajv copied to clipboard
`maxLength` constraint checking seems to not code (Hindi) unicode characters properly
What version of Ajv are you using? Does the issue happen if you use the latest version?
8.13.0
Ajv options object
ajv = new Ajv({});
JSON Schema
var schema = {
type: 'string',
maxLength: 30,
};
Sample data
var data = "विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में";
Your code
console.log(validate(data));
console.log(validate.errors);
Validation result, data AFTER validation, error messages
false
[
{
instancePath: '',
schemaPath: '#/maxLength',
keyword: 'maxLength',
params: { limit: 30 },
message: 'must NOT have more than 30 characters'
}
]
What results did you expect?
You can count yourself, the string has only 25 characters, it should pass.
Are you going to resolve the issue?
No
Hi there and thanks for reaching out.
'विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में'.length reports as 41 which I think is due to the multibyte characters required to write Hindi. I don't think that this is something that should be supported by the AJV core library but given the extensibility of AJV you could write your own keywords to correctly handle this text the way you think it should.
edit: actually it is not about unicode pair characters (which are counted as a single character by AJV, it seems to be related to how multiple characters, particularly accents, are grouped together in Hindi?
For example, look at the result of 'विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में'.split('')
(41) ['व', 'ि', 'क', 'ी', ' ', 'म', 'े', 'ड', ' ',
'म', 'े', 'ड', 'ि', 'क', 'ल', ' ', 'इ', 'न', 'स', 'ा',
'इ', 'क', '्', 'ल', 'ो', 'प', 'ी', 'ड', 'ि', 'य',
'ा', ' ', 'ह', 'ि', 'ं', 'द', 'ी', ' ', 'म', 'े', 'ं']
there is unicode option (deprecated, probably) that determines how length is computed.
https://github.com/ajv-validator/ajv/blob/master/lib/vocabularies/validation/limitLength.ts#L25
it's on by default (it does not use length), and if it's not working correctly, it needs fixing
https://github.com/ajv-validator/ajv/blob/master/lib/runtime/ucs2length.ts
Ok, I will have a look
Hi @kelson42 after discussing with EP we have decided that this is not something that we will be fixing within the core AJV library.
This problem is due to the multi-glyph characters that make up this Devanagari charset and no doubt many other languages. A single character like वि is actually made up of multiple characters व and 'ि (notice the dotted line circle that shows how this character interacts with others). These are called grapheme clusters.
From just inspecting the characters there is no metadata that will tell us which chars are part of a grapheme cluster and should therefore be counted as 1. For this reason we cannot put this logic into AJV as it would require a lot of bespoke code to cover all multi-glyph charsets.
This doesn't stop you from solving this problem yourself using custom keywords, you could even publish the solution for others, but it doesn't belong in the AJV code base.
I will however document this issue and I thank you again for bringing it to our attention.
edit: to add a link to the spec on grapheme clusters
@jasoniangreen @epoberezkin Thank you for considering my issue and for your advices. For the record, here how I have fixed the problem.