[Feature] Common field post-processing scraper enhancement
Is your feature request related to a problem? Please describe.
Suppose you're building a scraper that parses many websites that belong to the same studio. All the web sites follow the same structure, except they use a different base url (such as www.google.com or www.duckduckgo.com). Sometimes when scraping for the image on these sites it's impossible to get anything but the relative url, (such as /image/thumb.gif).
If our scraper were only ever dealing with one web site, we could hard code the base url for the image in post-processing, but when dealing with multiple sites, there's no way for our scraper to know which of the sites we're dealing with so we cannot hard code the base url.
Describe the solution you'd like Either,
-
Create a constant, such as $domain, that contains the value of the base url that is being scraped based off the url being scraped. So if the scene provided is http://www.domain.com/videos/scene.html, then $domain would contain
http://www.domain.com, or -
A more flexible solution (assuming it's feasible and doesn't break anything), allow post-processing for common fields so users can just make their own custom variables containing a base url. Here's an example of a simple one that could work (taken from Zenvo's example on Discord for expedience):
Common:
$site:
selector: //base/@href
postProcess:
- replace:
- regex: /tour/
with:
- replace:
- regex: http://
with: https://
And then
Image:
selector: //script[contains(.,'jwplayer("jwbox").setup')]/text()
replace:
- regex: (.+image:\s+")(.+jpg)(.+)
with: $2
- regex: ^
with: $site
@bnkai posted a way to accomplish my specific use-case using unions: https://github.com/stashapp/CommunityScrapers/pull/135
So I think this probably makes my feature request moot, at least for the use-case described here.
I had a similar thought as solution no. 1 on Discord a couple weeks back. https://discord.com/channels/559159668438728723/651418567475986432/742455016031518811
Message
peolic 10/08/2020
okay so, this is gonna be a little long, because I'm a bit lost and I'm just not sure what's a good way to go about this.
Anyone else feel like joining together a bunch of domains with similar scraping layouts is making per-site fields more limited?
It's kind of hard for me to explain... I can use
fixed: valueonStudio.Namefield when it's a single-site scraper, like I just did with aHotCrazyMessscraper (there's a PR), but now I found out it has pretty much the same layout as theNubilesscraper. EXCEPT theNubilesStudio.Nameselector is not available onhotcrazymess.com, and anyway I wanted the studio name to beHot Crazy Mess(nothotcrazymess), but that's not available.I thought maybe opening a feature request for an
$originvariable that will be equal to the currently scraped URL's origin part (https://hotcrazymess.comin this case), so I could maybe do this inNubiles:Studio: Name: fixed: $origin postProcess: - map: 'https://hotcrazymess.com': Hot Crazy MessThat seems like a dirty hack which is why I'm not sure about posting this feature request.
But it could still be used for when the URLs in the content don't have a the site's hostname and you need it for a full URL for Stash.
Performers: Name: $models/a/text() URL: selector: $models/a/@href postProcess: - replace: - regex: ^ with: $origin/Anyone got better ideas?
Constructing full URLs from relative links and other scraped info is pretty fragile and tedious. I have a scraper where it's impossible to construct the full URL for a subscraper without knowing the current full URL. Just the domain isn't enough, so I can't even do it with a hardcoded prefix. I'd love to have a specific postProcess just for expanding URLs to the full path.
Can this be resolved by https://github.com/stashapp/stash/pull/6250? or would common fragments still behelpful without yml unions?