joni icon indicating copy to clipboard operation
joni copied to clipboard

ECMA 262: \d should only match ASCII digits

Open fdutton opened this issue 2 years ago • 6 comments

Given this pattern ^\d$

This should match: 0

And this should not: ߀

fdutton avatar Apr 30 '23 21:04 fdutton

@fdutton on JRuby we behave as you describe. So something with our encodings will not match ߀ but does match 0. I am guessing you are using joni as a Java library so perhaps there is something config/call-wise which does behave this way?

Any extra info and we can try and figure out why we work and if we really are working how we get that result.

enebo avatar May 01 '23 16:05 enebo

It looks like Ruby(JRuby) restricts numerics to only be ASCII explicitly: https://github.com/jruby/joni/blob/master/src/org/joni/Syntax.java#L459

enebo avatar May 01 '23 16:05 enebo

I'll write some unit-tests but this is what I am doing to work around the issue.

// Joni is too liberal on some constructs
String s = regex
    .replace("\\d", "[0-9]")
    .replace("\\D", "[^0-9]")
    .replace("\\w", "[a-zA-Z0-9_]")
    .replace("\\W", "[^a-zA-Z0-9_]")
    .replace("\\s", "[ \\f\\n\\r\\t\\v\\u00a0\\u1680\\u2000-\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000\\ufeff]")
    .replace("\\S", "[^ \\f\\n\\r\\t\\v\\u00a0\\u1680\\u2000-\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000\\ufeff]");

byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
this.pattern = new Regex(bytes, 0, bytes.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);

fdutton avatar May 01 '23 18:05 fdutton

@fdutton I don't know where oniguruma repo is but you could check to see if syntax for ECMAScript was updated "up stream". We tend to look at the onigmo fork using by C Ruby but we are pretty far down stream. Perhaps there is a more up to date syntax?

enebo avatar May 01 '23 18:05 enebo

@enebo I think we are still on par wrt regexp functionality. We've been tracking https://github.com/k-takata/Onigmo/graphs/contributors and there's not a lot of activity there. There's been more changes in MRI codebase lately though.

lopex avatar May 01 '23 19:05 lopex

There also doesnt seem to be ecma syntax in neither Onigmo or MRI repository.

lopex avatar May 01 '23 19:05 lopex