ballerina-spec
ballerina-spec copied to clipboard
Support `\b` and `\B` in regexp
Description:
Currently, Ballerina does not supports \b
for word boundaries.
Java, Python and Javascripts supports \b
.
Some real world usecases of \b
are
-
Avoiding partial matches: When searching for specific words, it's essential to avoid matching them as part of longer words or phrases. For example, if you search for the word "cat" without word boundaries, you could unintentionally match "concatenate" or "category." By using
\b
, you can limit the match to standalone occurrences of the word. -
Finding repetitive words inside a string
r'\b(\w+)\b(?=.*\b\1\b)
This is the regex that used for this task in Python
Code sample that shows issue:
Related Issues: https://github.com/ballerina-platform/ballerina-lang/issues/40392
We'll have to consider supporting both \b
and \B
assertions.
Ref: https://262.ecma-international.org/#prod-Assertion
If we do \b
, we should certainly do \B
.
This isn't in the JSON schema interoperable subset: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-01#name-regular-expressions. Nor is it in I-Regexp https://datatracker.ietf.org/doc/draft-ietf-jsonpath-iregexp/ (which is what JSON schema is heading to https://github.com/orgs/json-schema-org/discussions/136).
The semantics of \b
and \B
can be defined in terms of \w
and \W
(i.e. a word boundary is a point where the character on one side matches \w
and the character on the one side doesn't), which we have. So it makes sense to include this, particularly if we are doing lookahead assertions #1241