webappanalyzer icon indicating copy to clipboard operation
webappanalyzer copied to clipboard

Image regex?

Open djay opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe. I'm working on contributing many changes to better detect CMS and one of the best ways I'm finding is using image urls. I'm finding this means many similar dom selectors for all the different places img urls can be such as background styles, meta content, srcsets, etc etc. I suspect it would be more efficient to run and to write with an enhancement to the schema and have an imageSrc list similar to scriptSrc.

However this would of course break compatibility which might be against the goals?

Describe the solution you'd like

"imageSrc": [
   "/media"
}

Describe alternatives you've considered

Additional context

djay avatar Nov 06 '24 04:11 djay

In my opinion DOM selectors should be used for this case. I guess you could also include them as url elements. However the point of DOM selectors is to be quicker than regex. If you're going to develop strong regex then you may as well develop strong DOM selectors.

PS I hope /media was a joke example.

kingthorin avatar May 07 '25 15:05 kingthorin

@kingthorin The problem is that you end up with a lot of the same DOM selectors. There are about 10 different places in the dom an image can be.

You end up with rules like the following with trying to find images in inline css, meta tags etc etc.

            "img[src*='imager/'],meta[content*='imager/']": {
                "attributes": {
                    "src": "/_?imager/.*/\\d{4,8}/.*_[a-z0-9]{32}\\.",
                    "content": "/_?imager/.*/\\d{4,8}/.*_[a-z0-9]{32}\\."
                }
            },
            "img[src*='_crop_'],img[data-src*='_crop_'],meta[content*='_crop_'],img[src*='_fit_'],img[data-src*='_fit_'],meta[content*='_fit_']": {
                "attributes": {
                    "src": "/[_0-9x]+_(?:crop|fit)_(?:center|top|bottom)-(?:center|left|right)_",
                    "data-src": "/[_0-9x]+_(?:crop|fit)_(?:center|top|bottom)-(?:center|left|right)_",
                    "content": "/[_0-9x]+_(?:crop|fit)_(?:center|top|bottom)-(?:center|left|right)_"
                }
            },
            "img[src*='/_'],meta[content*='/_']": {
                "attributes": {
                    "src": "/_(?:full|medium|quarter)(?:Rectangular|Landscape)/",
                    "content": "/_(?:full|medium|quarter)(?:Rectangular|Landscape)/"
                }
            },
            "img[src*='dm='],meta[content]": {
                "attributes": {
                    "src": "&dm=\\d{10}&s=[0-9a-z]{32}",
                    "content": "&dm=\\d{10}&s=[0-9a-z]{32}"
                }
            }

being replaced by

        "imgSrc": [
            "/_?imager/.*/\\d{4,8}/.*_[a-z0-9]{32}\\.",
            "/_(?:\\d{2,5}|AUTO)x(?:\\d{2,5}|AUTO)_(?:crop|fit)_(?:center|top|bottom)-(?:center|left|right)(?:_|/)",
            "/_(?:full|medium|quarter)(?:Rectangular|Landscape)/",
            "&dm=\\d{10}&s=[0-9a-z]{32}"
        ],

and some code like

        img_urls = [getprop(d, 'src') for d in css('img[src]')]
        img_urls += [getprop(d, 'data-src') for d in css('img[data-src]')]
        img_urls += [getprop(d, 'content') for d in css("meta[property*='image']")]
        img_urls += [getprop(d, 'content') for d in css("meta[name*='image']")]
        img_urls += [getprop(d, 'href') for d in css("link[rel='icon']")]
        img_urls += [getprop(d, 'data-background') for d in css("*[data-background]")]
        img_urls += [src.split(" ")[0] for d in css("img[data-srcset]") for src in getprop(d, 'data-srcset').split(", ")]
        img_urls += [src.split(" ")[0] for d in css("img[srcset],source[srcset]") for src in getprop(d, 'srcset').split(", ")]
        img_urls += [src for d in css("*[style]") for src in re.findall("url\\('?([^)']+)'?\\)", getprop(d, 'style'))]
        img_urls += [src for d in css("style") for src in re.findall("url\\('?([^)']+)'?\\)", d.text_content())]

djay avatar May 20 '25 03:05 djay

@djay thank you, now your proposal makes much more sense.

kingthorin avatar May 20 '25 10:05 kingthorin