ballerina-spec icon indicating copy to clipboard operation
ballerina-spec copied to clipboard

Support `\b` and `\B` in regexp

Open SasinduDilshara opened this issue 1 year ago • 2 comments

Description: Currently, Ballerina does not supports \b for word boundaries. Java, Python and Javascripts supports \b.

Some real world usecases of \b are

  • Avoiding partial matches: When searching for specific words, it's essential to avoid matching them as part of longer words or phrases. For example, if you search for the word "cat" without word boundaries, you could unintentionally match "concatenate" or "category." By using \b, you can limit the match to standalone occurrences of the word.

  • Finding repetitive words inside a string r'\b(\w+)\b(?=.*\b\1\b) This is the regex that used for this task in Python

Code sample that shows issue:

Related Issues: https://github.com/ballerina-platform/ballerina-lang/issues/40392

SasinduDilshara avatar May 11 '23 05:05 SasinduDilshara

We'll have to consider supporting both \b and \B assertions. Ref: https://262.ecma-international.org/#prod-Assertion

pcnfernando avatar May 11 '23 09:05 pcnfernando

If we do \b, we should certainly do \B.

This isn't in the JSON schema interoperable subset: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-01#name-regular-expressions. Nor is it in I-Regexp https://datatracker.ietf.org/doc/draft-ietf-jsonpath-iregexp/ (which is what JSON schema is heading to https://github.com/orgs/json-schema-org/discussions/136).

The semantics of \b and \B can be defined in terms of \w and \W (i.e. a word boundary is a point where the character on one side matches \w and the character on the one side doesn't), which we have. So it makes sense to include this, particularly if we are doing lookahead assertions #1241

jclark avatar May 11 '23 11:05 jclark