ajv icon indicating copy to clipboard operation
ajv copied to clipboard

`maxLength` constraint checking seems to not code (Hindi) unicode characters properly

Open kelson42 opened this issue 1 year ago • 5 comments

What version of Ajv are you using? Does the issue happen if you use the latest version?

8.13.0

Ajv options object

ajv = new Ajv({});

JSON Schema

var schema = {
    type: 'string',
    maxLength: 30,
};

Sample data

var data = "विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में";

Your code

console.log(validate(data));
console.log(validate.errors);

Validation result, data AFTER validation, error messages

false
[
  {
    instancePath: '',
    schemaPath: '#/maxLength',
    keyword: 'maxLength',
    params: { limit: 30 },
    message: 'must NOT have more than 30 characters'
  }
]

What results did you expect?

You can count yourself, the string has only 25 characters, it should pass.

Are you going to resolve the issue?

No

kelson42 avatar Apr 30 '24 20:04 kelson42

Hi there and thanks for reaching out.

'विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में'.length reports as 41 which I think is due to the multibyte characters required to write Hindi. I don't think that this is something that should be supported by the AJV core library but given the extensibility of AJV you could write your own keywords to correctly handle this text the way you think it should.

edit: actually it is not about unicode pair characters (which are counted as a single character by AJV, it seems to be related to how multiple characters, particularly accents, are grouped together in Hindi?

For example, look at the result of 'विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में'.split('')

(41) ['व', 'ि', 'क', 'ी', ' ', 'म', 'े', 'ड', ' ', 
'म', 'े', 'ड', 'ि', 'क', 'ल', ' ', 'इ', 'न', 'स', 'ा', 
'इ', 'क', '्', 'ल', 'ो', 'प', 'ी', 'ड', 'ि', 'य', 
'ा', ' ', 'ह', 'ि', 'ं', 'द', 'ी', ' ', 'म', 'े', 'ं']

jasoniangreen avatar Apr 30 '24 21:04 jasoniangreen

there is unicode option (deprecated, probably) that determines how length is computed.

epoberezkin avatar May 01 '24 08:05 epoberezkin

https://github.com/ajv-validator/ajv/blob/master/lib/vocabularies/validation/limitLength.ts#L25

epoberezkin avatar May 01 '24 08:05 epoberezkin

it's on by default (it does not use length), and if it's not working correctly, it needs fixing

https://github.com/ajv-validator/ajv/blob/master/lib/runtime/ucs2length.ts

epoberezkin avatar May 01 '24 08:05 epoberezkin

Ok, I will have a look

jasoniangreen avatar May 01 '24 21:05 jasoniangreen

Hi @kelson42 after discussing with EP we have decided that this is not something that we will be fixing within the core AJV library.

This problem is due to the multi-glyph characters that make up this Devanagari charset and no doubt many other languages. A single character like वि is actually made up of multiple characters and 'ि (notice the dotted line circle that shows how this character interacts with others). These are called grapheme clusters.

From just inspecting the characters there is no metadata that will tell us which chars are part of a grapheme cluster and should therefore be counted as 1. For this reason we cannot put this logic into AJV as it would require a lot of bespoke code to cover all multi-glyph charsets.

This doesn't stop you from solving this problem yourself using custom keywords, you could even publish the solution for others, but it doesn't belong in the AJV code base.

I will however document this issue and I thank you again for bringing it to our attention.

edit: to add a link to the spec on grapheme clusters

jasoniangreen avatar May 07 '24 20:05 jasoniangreen

@jasoniangreen @epoberezkin Thank you for considering my issue and for your advices. For the record, here how I have fixed the problem.

kelson42 avatar May 09 '24 13:05 kelson42