jsonnet icon indicating copy to clipboard operation
jsonnet copied to clipboard

stdlib: add support regexp match and replace for strings

Open mikedanese opened this issue 9 years ago • 22 comments

Would be easy to implement as a builtin.

https://github.com/google/re2

mikedanese avatar Feb 18 '16 04:02 mikedanese

Hard in Jsonnet, easy as a builtin, but the trick is to make sure every language that someone might want to implement Jsonnet in (and therefore have to provide an implementation for each of the builtins) has a native regular expression library with exactly the same regex syntax and semantics.

sparkprime avatar Feb 18 '16 16:02 sparkprime

PCRE seems to be implemented in many languages. Perhaps it would be better to implement this as a native extension (#108) to the language and not part of the core.

mikedanese avatar Feb 18 '16 19:02 mikedanese

Does it support unicode typically?

sparkprime avatar Feb 18 '16 19:02 sparkprime

+1

nand0p avatar May 20 '16 18:05 nand0p

It appears that PCRE does typically support unicode: http://man7.org/linux/man-pages/man3/pcreunicode.3.html

benley avatar May 20 '16 18:05 benley

Would be great if the 3 of you could offer some real use cases for this functionality so I can figure out how to prioritize it.

sparkprime avatar May 20 '16 20:05 sparkprime

For our use case, we need to strip all non-alphanumeric chars from a string variable. i am thinking this can be done currently by splitting string to array of chars, checking each char, and then rejoining.... but that seems very ugly.

This may be able to be done more easily with a new function like std.toAlpha(x), but i would think full-on regex capabilities would be a more complete solution.

nand0p avatar May 20 '16 20:05 nand0p

It's not too bad but I can see why you'd rather write it with 0-9A-Za-z type ranges and on one line.

local is_alpha(x) =
    std.setMember(x,"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
local to_alpha(str) = std.join("", std.filter(is_alpha, std.stringChars(str)));
to_alpha("He.llo World123")

sparkprime avatar May 20 '16 20:05 sparkprime

I have a case where I'd like to be able to replace all instances of - with _. That seems even more cumbersome than the filter-to-alphanumeric use-case discussed above. Regex is overkill for my use-case, but either regex-based or tr-style replace functionality would be helpful.

On a nearer-term note, can the stdlib/built-ins be composed to produce a 'replace each instance of character x with an instance of character y' behavior? I'm not coming up with it, though I'm quite new to Jsonnet.

emmanuel avatar Dec 13 '17 05:12 emmanuel

Here's an example implementation of proposed 'replace each instance of character x with an instance of character y'

local replaceChars(str, mapping) =
    std.join("", std.map(function(c) if c in mapping then mapping[c] else c, std.stringChars(str)));
replaceChars("abcd", {"b": "!", "d": "?"})

produces: "a!c?"

The implementation above also support deleting characters or replacing them with multiple characters: replaceChars("abcd", {"b": "xx", "d": ""}) produces: "axxc"

It may be useful generally enough to add it to stdlib. @sparkprime what do you think?

sbarzowski avatar Dec 13 '17 09:12 sbarzowski

@sbarzowski thank you, that is fantastic. Not only a clean interface to accomplish what I'm looking to get done, but also a good bit of insight about how to approach programming jsonnet. I'd be in favor of adding this to stdlib, but I'm not on the hook for maintenance, so perhaps merely adding to the documentation would be sufficient to help future seekers like myself.

Whatever the decision about adding to stdlib or docs, thank you for the help @sbarzowski.

emmanuel avatar Dec 13 '17 20:12 emmanuel

tr-like functionality is definitely a good candidate for stdlib.

sparkprime avatar Dec 13 '17 20:12 sparkprime

Since this has come up again - do we have compatible implementations of PCRE in Go and C++ that will work with unicode?

sparkprime avatar Dec 13 '17 20:12 sparkprime

Well, there is this thing: https://github.com/glenn-brown/golang-pkg-pcre. This is an interface to libpcre. It seems to hardcode assumptions about where libpcre is installed, though... I couldn't find anything else. Probably using libpcre directly with cgo would be a better option.

sbarzowski avatar Dec 13 '17 21:12 sbarzowski

My guess is that that package defeats part of the purpose of go-jsonnet, which is to allow go programs to use jsonnet without cgo. Could be wrong though. :-)

kamalmarhubi avatar Jan 23 '18 05:01 kamalmarhubi

Yeah I think unless we can find a library that has native Go and C++ support (for exactly the same regex syntax) we'll have to leave regexes as something that people add with native extensions.

sparkprime avatar Jan 26 '18 02:01 sparkprime

Coming back full circle, would RE2 along with Go's built-in regexp package not be a good fit? There're Unicode aware and syntax compatible.

From Go's regexp package documentation:

"The syntax of the regular expressions accepted is the same general syntax used by Perl, Python, and other languages. More precisely, it is the syntax accepted by RE2 and described at https://golang.org/s/re2syntax, except for \C."

(The re2syntax link goes to the actual RE2 documentation)

dcoles avatar May 25 '19 19:05 dcoles

In that case I guess RE2 is the way forward after all :)

sparkprime avatar May 28 '19 13:05 sparkprime

I'm currently prototyping RE2 regexp support in my https://github.com/google/jsonnet/compare/master...dcoles:re2 branch.

Boolean matches can be implemented pretty trivially, but positional and named captures are going to require a bit more thought. The current plan is to have a match return an object upon successful match or null otherwise. For example:

$ jsonnet -e 'std.regexFullMatch("hello", "h(?P<mid>.*)o")'
{
   "captures": [
      "ell"
   ],
   "namedCaptures": {
      "mid": "ell"
   },
   "string": "hello"
}

This way you can still do things like assert std.regexFullMatch(self.foo, "pattern") != null for validation or use the object fields for accessing captured values.

dcoles avatar May 29 '19 04:05 dcoles

I see the PR, which I eagerly anticipate, but just to summarize the points and questions about RE2:

  • RE2 has native implementations in both C++ and Go.
  • RE2 supports Unicode.
  • There are wrappers for most languages. See the bottom of the README.

glenntrewitt avatar Jul 03 '21 19:07 glenntrewitt

It's nice to see this move along, the discussions on the PR is promising.

I have a use case involving JSON schema, I'm building a validator in jsonnet and turns out JSON schema has a few features that use regular expressions. I don't know much about the different implementions of regex in the wild, the schema spec depends on the ECMA 262 implementation.

I think it would be safe to provide one native implementation in stdlib and if users need a different for their use case they can leverage the native functions feature (or if they feel adventures, they can implement one in jsonnet).

Duologic avatar Nov 27 '22 08:11 Duologic

Just had a quick look in other projects as I was curious:

Kubernetes uses regexp to validate the JSON schema pattern attribute (link).

ogen has an interface with a fallback from regexp to dlclark/regexp2 in case regexp doesn't compile (link). This was introduced to workaround the shortcomings of re2, which the PCRE/ECMA-262 supposedly supports (re2 support table and re2 caveats). This fallback library might be interesting for go-jsonnet.

Duologic avatar Nov 27 '22 16:11 Duologic