jslt icon indicating copy to clipboard operation
jslt copied to clipboard

string replace should have options for both literal string replacement and regex replacement

Open samer1977 opened this issue 1 year ago • 10 comments

no doubt having the regex replace is very powerful but sometimes you want to do simple a literal string replace.Where I encountered a problem is when I wanted to replace literal string that contains regex chars. I could not find an easy way to do that but having to replace all regex reserved char first to escape them and that can get cumbersome and inefficient.

samer1977 avatar May 22 '24 13:05 samer1977

@samer1977 Can you give an example ?

catull avatar May 22 '24 15:05 catull

OK, I was trying to solve the problem here: https://github.com/schibsted/jslt/issues/342 using recursive function call as such:

def capture-many(json,regex,key)

   let c = capture($json, $regex)

   let res = if($c =={}) [] 
                 
             else  
                    
             [$c]+capture-many(replace($json, get-key($c,$key), ""),$regex,$key)
   $res

capture-many(.body,"<img src=\"(?<url>[a-z])\">","url")

To do that I have to get each capture , store in an array , then do the next capture recursively by purging the json through replace with empty string and so until no capture left. I understand that there is limitation where each capture has to be different and you can only have one key-value pair capture which would have worked for this scenario.

The above function would have worked on more simplistic scenario like this:

{ "body" : "<div class="intercom-container"><img src="image1">. <div class="intercom-container"><img src="image2">" }

However once you introduce more complex string like urls then it wont because regex.

samer1977 avatar May 23 '24 08:05 samer1977

Sorry to be a nag, can you give us a challenging example ? Are you talking about a URL that contains any one of these characters:

  • '*'
  • '(', ')', '[', ']', '{' or '}'
  • '&', '' ?

catull avatar May 23 '24 08:05 catull

Yes. Can you make the recursive function above work on the original input without having to escape every regex char ?

{ "body" : "<div class="intercom-container"><img src="https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34">. <div class="intercom-container"><img src="https://downloads.intercomcdn.com/i/o/24306960011/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34">" }

samer1977 avatar May 23 '24 08:05 samer1977

How about this.

input

{ "body": "<div class='intercom-container'><img src=\"image1\"></img></div><div class=\"intercom-container\"><img src=\"image2\"></img></div><div class=\"intercom-container\"><img src=\"image3\" /><div class=\"intercom-container\"><img src=\"image4\"/><img src='^&*image5'/></div>"
}

You have different image tags

<img src='image1'></img>
<img src="image2"></img>
<img src="image3" />
<img src="image4"/>
<img src='^&*image5'/>

Here's the transformation:

[ for (split (.body, "<img ")[1:])
  capture (., "^src=\"(?<url>[^\"']+)\"")
]

No need to use recursion.

The trick is to split up the input with the "seperator" <img .

Yes, there can be any funny characters in the src-attribute, even regexp "reserved" characters. We are capturing only what is in between src=" and the ending double quote.

The result is:

[ {
    "url" : "image1"
  }, {
    "url" : "image2"
  }, {
    "url" : "image3"
  }, {
    "url" : "image4"
  }, {
    "url" : "^&*image5"
  }
]

catull avatar May 23 '24 09:05 catull

How did I come to this solution ?

I first only used this

split (.body, "<img ")

This gave me

[
    "<div class=\"intercom-container\">",
    "src=\"image1\"></img></div><div class=\"intercom-container\">",
    "src=\"image2\"></img></div><div class=\"intercom-container\">",
    "src=\"image3\" /><div class=\"intercom-container\">",
    "src=\"image4\"/>",
    "src=\"^&*image5\"/></div>"
]

As you can see, the first element in the array, does not have a "src=" at the beginning. So it has to be excluded, thus changing the transformation to

split (.body, "<img ")[1:]

Now that all elements start with "src=", the regexp just becomes - basically anything betwen the quotes:

  capture (., "^src=\"(?<url>[^\"']+)\"")

And now you wrap array processing around it resulting in "[ for ..... capture (...) ]".

catull avatar May 23 '24 09:05 catull

First of all, your input is not properly formatted; you should get plenty of errors in the sandbox alone. You are not properly escaping the double quotes in the body attribute.

Here's how it should be:

{ "body": "<div class=\"intercom-container\"><img src=\"image1\">. <div class=\"intercom-container\"><img src=\"image2\">"
}

Second, you are not properly regexing. All you need to do is express that you want capture all non-double quotes after <img src=" up-to before the next double quote.

image

You want to exclude anything that is blue, only capture the orange string. The part in purple - (?<url> and ) after the '+' - is only there to tell regexp that you have a capturing group named url.

So, this below should work.

def capture-many (json, regex, key)
   let c = capture ($json, $regex)
   let res = if ($c == {}) [] 
             else [$c] + capture-many (replace($json, get-key($c,$key), ""),$regex,$key)

   $res

capture-many (.body, "<img src=\"(?<url>[^\"]+)\">", "url")

A few words of advise.

  1. Use the playground -> https://www.garshol.priv.no/jslt-demo
  2. Use a proper JSON editor, it would have pointed out the quotation problems to you.
  3. Learn about regex, think in positive / negative - what is in, what is out.
  4. Specifically, regexp character classes.

Instead of [a-z], which only captures lower-case letters of the alphabet, you have to look at it differently.

What is in: anything orange above, that means, a sequence of 1 or more characters EXCEPT for a double quote. What is out: <img src=" at the beginning, and "...... at the end.

The "in" part is expressed as [^"]+, this is called a negated character class.

A good resource is https://www.regular-expressions.info/charclass.html

Good luck.

catull avatar May 23 '24 15:05 catull

OK! Thanks for your detailed answer. I appreciate it , at least its detailed. I will take everything you said into consideration and try to be careful when posting data\code. I did not pay much attention to what I was pasting because I made it clear early on that this all based on this: https://github.com/schibsted/jslt/issues/342 and that should have been your source. No excuse though I will try and do better next time. I'm using all the above and I understand regex very well but sorry Im still human.

samer1977 avatar May 23 '24 16:05 samer1977

We are all here to learn from each other.

catull avatar May 23 '24 17:05 catull

I created a PR, see #350.

catull avatar May 27 '24 06:05 catull