joni ECMA 262: \d should only match ASCII digits

Given this pattern ^\d$

This should match: 0

And this should not: ߀

Apr 30 '23 21:04 fdutton

@fdutton on JRuby we behave as you describe. So something with our encodings will not match ߀ but does match 0. I am guessing you are using joni as a Java library so perhaps there is something config/call-wise which does behave this way?

Any extra info and we can try and figure out why we work and if we really are working how we get that result.

May 01 '23 16:05 enebo

It looks like Ruby(JRuby) restricts numerics to only be ASCII explicitly: https://github.com/jruby/joni/blob/master/src/org/joni/Syntax.java#L459

May 01 '23 16:05 enebo

I'll write some unit-tests but this is what I am doing to work around the issue.

// Joni is too liberal on some constructs
String s = regex
    .replace("\\d", "[0-9]")
    .replace("\\D", "[^0-9]")
    .replace("\\w", "[a-zA-Z0-9_]")
    .replace("\\W", "[^a-zA-Z0-9_]")
    .replace("\\s", "[ \\f\\n\\r\\t\\v\\u00a0\\u1680\\u2000-\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000\\ufeff]")
    .replace("\\S", "[^ \\f\\n\\r\\t\\v\\u00a0\\u1680\\u2000-\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000\\ufeff]");

byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
this.pattern = new Regex(bytes, 0, bytes.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);

May 01 '23 18:05 fdutton

@fdutton I don't know where oniguruma repo is but you could check to see if syntax for ECMAScript was updated "up stream". We tend to look at the onigmo fork using by C Ruby but we are pretty far down stream. Perhaps there is a more up to date syntax?

May 01 '23 18:05 enebo

@enebo I think we are still on par wrt regexp functionality. We've been tracking https://github.com/k-takata/Onigmo/graphs/contributors and there's not a lot of activity there. There's been more changes in MRI codebase lately though.

May 01 '23 19:05 lopex

There also doesnt seem to be ecma syntax in neither Onigmo or MRI repository.

May 01 '23 19:05 lopex