Image regex?
Is your feature request related to a problem? Please describe. I'm working on contributing many changes to better detect CMS and one of the best ways I'm finding is using image urls. I'm finding this means many similar dom selectors for all the different places img urls can be such as background styles, meta content, srcsets, etc etc. I suspect it would be more efficient to run and to write with an enhancement to the schema and have an imageSrc list similar to scriptSrc.
However this would of course break compatibility which might be against the goals?
Describe the solution you'd like
"imageSrc": [
"/media"
}
Describe alternatives you've considered
Additional context
In my opinion DOM selectors should be used for this case. I guess you could also include them as url elements. However the point of DOM selectors is to be quicker than regex. If you're going to develop strong regex then you may as well develop strong DOM selectors.
PS I hope /media was a joke example.
@kingthorin The problem is that you end up with a lot of the same DOM selectors. There are about 10 different places in the dom an image can be.
You end up with rules like the following with trying to find images in inline css, meta tags etc etc.
"img[src*='imager/'],meta[content*='imager/']": {
"attributes": {
"src": "/_?imager/.*/\\d{4,8}/.*_[a-z0-9]{32}\\.",
"content": "/_?imager/.*/\\d{4,8}/.*_[a-z0-9]{32}\\."
}
},
"img[src*='_crop_'],img[data-src*='_crop_'],meta[content*='_crop_'],img[src*='_fit_'],img[data-src*='_fit_'],meta[content*='_fit_']": {
"attributes": {
"src": "/[_0-9x]+_(?:crop|fit)_(?:center|top|bottom)-(?:center|left|right)_",
"data-src": "/[_0-9x]+_(?:crop|fit)_(?:center|top|bottom)-(?:center|left|right)_",
"content": "/[_0-9x]+_(?:crop|fit)_(?:center|top|bottom)-(?:center|left|right)_"
}
},
"img[src*='/_'],meta[content*='/_']": {
"attributes": {
"src": "/_(?:full|medium|quarter)(?:Rectangular|Landscape)/",
"content": "/_(?:full|medium|quarter)(?:Rectangular|Landscape)/"
}
},
"img[src*='dm='],meta[content]": {
"attributes": {
"src": "&dm=\\d{10}&s=[0-9a-z]{32}",
"content": "&dm=\\d{10}&s=[0-9a-z]{32}"
}
}
being replaced by
"imgSrc": [
"/_?imager/.*/\\d{4,8}/.*_[a-z0-9]{32}\\.",
"/_(?:\\d{2,5}|AUTO)x(?:\\d{2,5}|AUTO)_(?:crop|fit)_(?:center|top|bottom)-(?:center|left|right)(?:_|/)",
"/_(?:full|medium|quarter)(?:Rectangular|Landscape)/",
"&dm=\\d{10}&s=[0-9a-z]{32}"
],
and some code like
img_urls = [getprop(d, 'src') for d in css('img[src]')]
img_urls += [getprop(d, 'data-src') for d in css('img[data-src]')]
img_urls += [getprop(d, 'content') for d in css("meta[property*='image']")]
img_urls += [getprop(d, 'content') for d in css("meta[name*='image']")]
img_urls += [getprop(d, 'href') for d in css("link[rel='icon']")]
img_urls += [getprop(d, 'data-background') for d in css("*[data-background]")]
img_urls += [src.split(" ")[0] for d in css("img[data-srcset]") for src in getprop(d, 'data-srcset').split(", ")]
img_urls += [src.split(" ")[0] for d in css("img[srcset],source[srcset]") for src in getprop(d, 'srcset').split(", ")]
img_urls += [src for d in css("*[style]") for src in re.findall("url\\('?([^)']+)'?\\)", getprop(d, 'style'))]
img_urls += [src for d in css("style") for src in re.findall("url\\('?([^)']+)'?\\)", d.text_content())]
@djay thank you, now your proposal makes much more sense.