jslt icon indicating copy to clipboard operation
jslt copied to clipboard

Capture function only returns one match

Open lev-tonkean opened this issue 1 year ago • 7 comments

I'm trying to get all the img src URLs from an HTML body in one of the json fields: { "body" : "<div class=\"intercom-container\"><img src=\"https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34\"></div>. <div class=\"intercom-container\"><img src=\"https://downloads.intercomcdn.com/i/o/24306960011/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34\"></div>" }

If I do the following JSLT code capture($node.body, "<img src=\"(?<url>https://[^\"]+)\">") I will get just the first img URL but not the second.

There should be a way to return all matches...

lev-tonkean avatar May 21 '24 22:05 lev-tonkean

You make an assumption that the function capture works in a way, that is not documented.

How is JSLT supposed to know that more than 1 URL appears in the text ?

Instead, it only finds at most 1 occurence.

Currently you have to structure your node.body attribute to be:

{
    "body": [
        "<div class='intercom-container'><img src='https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34'></div>",
        "<div class='intercom-container'><img src='https://downloads.intercomcdn.com/i/o/24306960011/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34'></div>"
   ]
}

The JSLT transformation then is

[
  for (.body)
     capture (., "<img src='(?<url>https://[^']+)'>")
]

resulting in:

[ {
  "url" : "https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34"
}, {
  "url" : "https://downloads.intercomcdn.com/i/o/24306960011/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34"
} ]

catull avatar May 22 '24 10:05 catull

How is JSLT supposed to know that more than 1 URL appears in the text ?

I don't think this is the right way to view the issue. The issue being raised is:

There should be a way to return all matches...

And clearly there has to be a way to do that. We can't require people to structure the input in a way that fits JSLT. The language has to be designed to handle all JSON inputs.

It is possible to do this now by using capture(), then finding the url match in the string, then slicing the string to remove the matched part, and then using capture() again. A recursive function can do this for any number of matches. It's slow, however, and pretty cumbersome.

One way to solve it would be to give capture() a third argument to tell it to return all matches. This would then be an array of dicts instead of just a dict. It's a bit ugly to have different return signatures, so one might add capture-all() as an alternative. I see the -all() variant makes no sense for the other two regexp functions, so we don't risk suddenly having to make 6 regexp functions.

larsga avatar May 22 '24 10:05 larsga

I see your points, regarding capture().

Until we have a capture-all(), you have to use what's there.

Whether implementing a recursive function, or restructuring the input, neither is elegant.

Not knowing a lot about the original use case, if the developer is capable of chunking some source HTML into a JSON object carrying a body attribute, it is safe to assume that the same source HTML can be split into chunks of divs, such as //div[class='intercom-container'].

catull avatar May 22 '24 11:05 catull

Found another solution, without having to change the input:

[ for (split (.body, "</div>"))
   capture (., "<img src=\"(?<url>https://[^\"]+)\">")
]

catull avatar May 22 '24 11:05 catull

Found another solution, without having to change the input:

[ for (split (.body, "</div>"))
   capture (., "<img src=\"(?<url>https://[^\"]+)\">")
]

this solution worked! thanks.

lev-tonkean avatar May 23 '24 19:05 lev-tonkean

i do think having a capture-all function makes a lot of sense.

lev-tonkean avatar May 23 '24 19:05 lev-tonkean

Try this one:

[ for (split (.body, "<img ")[1:])
  capture (., "^src=\"(?<url>[^\"']+)\"")
]

It supports all kinds of URLs.

catull avatar May 23 '24 20:05 catull