stash icon indicating copy to clipboard operation
stash copied to clipboard

[Feature] Common field post-processing scraper enhancement

Open echo6ix opened this issue 5 years ago • 4 comments

Is your feature request related to a problem? Please describe. Suppose you're building a scraper that parses many websites that belong to the same studio. All the web sites follow the same structure, except they use a different base url (such as www.google.com or www.duckduckgo.com). Sometimes when scraping for the image on these sites it's impossible to get anything but the relative url, (such as /image/thumb.gif).

If our scraper were only ever dealing with one web site, we could hard code the base url for the image in post-processing, but when dealing with multiple sites, there's no way for our scraper to know which of the sites we're dealing with so we cannot hard code the base url.

Describe the solution you'd like Either,

  1. Create a constant, such as $domain, that contains the value of the base url that is being scraped based off the url being scraped. So if the scene provided is http://www.domain.com/videos/scene.html, then $domain would contain http://www.domain.com, or

  2. A more flexible solution (assuming it's feasible and doesn't break anything), allow post-processing for common fields so users can just make their own custom variables containing a base url. Here's an example of a simple one that could work (taken from Zenvo's example on Discord for expedience):

Common:
        $site: 
          selector: //base/@href
          postProcess: 
            - replace: 
              - regex: /tour/
                with: 
            - replace: 
              - regex: http://
                with: https://

And then

Image:
        selector: //script[contains(.,'jwplayer("jwbox").setup')]/text()
        replace:
          - regex: (.+image:\s+")(.+jpg)(.+)
            with: $2
          - regex: ^
            with: $site
      

echo6ix avatar Aug 24 '20 05:08 echo6ix

@bnkai posted a way to accomplish my specific use-case using unions: https://github.com/stashapp/CommunityScrapers/pull/135

So I think this probably makes my feature request moot, at least for the use-case described here.

echo6ix avatar Aug 24 '20 06:08 echo6ix

I had a similar thought as solution no. 1 on Discord a couple weeks back. https://discord.com/channels/559159668438728723/651418567475986432/742455016031518811

Message

peolic 10/08/2020

okay so, this is gonna be a little long, because I'm a bit lost and I'm just not sure what's a good way to go about this.

Anyone else feel like joining together a bunch of domains with similar scraping layouts is making per-site fields more limited?

It's kind of hard for me to explain... I can use fixed: value on Studio.Name field when it's a single-site scraper, like I just did with a HotCrazyMess scraper (there's a PR), but now I found out it has pretty much the same layout as the Nubiles scraper. EXCEPT the Nubiles Studio.Name selector is not available on hotcrazymess.com, and anyway I wanted the studio name to be Hot Crazy Mess (not hotcrazymess), but that's not available.

I thought maybe opening a feature request for an $origin variable that will be equal to the currently scraped URL's origin part (https://hotcrazymess.com in this case), so I could maybe do this in Nubiles:

Studio:
  Name:
    fixed: $origin
    postProcess:
      - map:
          'https://hotcrazymess.com': Hot Crazy Mess

That seems like a dirty hack which is why I'm not sure about posting this feature request.

But it could still be used for when the URLs in the content don't have a the site's hostname and you need it for a full URL for Stash.

Performers:
  Name: $models/a/text()
  URL:
    selector: $models/a/@href
    postProcess:
      - replace:
        - regex: ^
          with: $origin/

Anyone got better ideas?

peolic avatar Aug 24 '20 07:08 peolic

Constructing full URLs from relative links and other scraped info is pretty fragile and tedious. I have a scraper where it's impossible to construct the full URL for a subscraper without knowing the current full URL. Just the domain isn't enough, so I can't even do it with a hardcoded prefix. I'd love to have a specific postProcess just for expanding URLs to the full path.

timo95 avatar Mar 11 '25 16:03 timo95

Can this be resolved by https://github.com/stashapp/stash/pull/6250? or would common fragments still behelpful without yml unions?

feederbox826 avatar Dec 03 '25 22:12 feederbox826