CommunityScrapers icon indicating copy to clipboard operation
CommunityScrapers copied to clipboard

Scraper Target CSS Pseudo-Elements

Open DaleIndigo opened this issue 3 years ago • 1 comments

Sorry if this isn't the appropriate place to ask this, but I've been searching off-and-on for about a week and can't find an answer.

I'm trying to create a Scraper for Strokies. I'm a web developer but not experienced with Python in general. I got it mostly working. The only problem is most of the DIVs don't have classes or IDs, so I was wondering if I could target them with something like "last-of-child" or "nth-of-type", etc.

Here's what I got and only the Title works. Hopefully someone knows whether or not this is possible and the proper syntax to do it.

name: "Strokies"
sceneByURL:
  - action: scrapeXPath
    url:
      - strokies.com/video/
    scraper: strokiesScraper
xPathScrapers:
  strokiesScraper:
    scene:
      Title:
        selector: //h1/text()
      Date:
        selector: //div[@class="video-info"]/div/p:nth-of-type(3)/text()
        postProcess:
          - parseDate: Jan 2, 2006
      Details:
        selector: //div[contains(@class, "video-text")]/div:nth-of-type(4)/p
        concat: "\n\n"
      Performers:
        Name: //div[contains(@class, "video-text")]/div:nth-of-type(2)/a
      Tags:
        Name: //div[contains(@class, "video-text")]/div:nth-of-type(3)/a
      Image:
        selector: //img[@class="vjs-tech"]
        postProcess:
          - replace:
              - regex: .+(?:poster=)([^"]*)
                with: $1
      Studio:
        Name:
          fixed: Strokies

Any hep with that syntax would be greatly appreciated.

TIA!

DaleIndigo avatar Mar 15 '22 17:03 DaleIndigo

What stash uses is xpaths not css. You can have a look at https://github.com/stashapp/stash/blob/develop/ui/v2.5/src/docs/en/ScraperDevelopment.md#xpath-and-json-scrapers-configuration and https://devhints.io/xpath for more details. For example div:nth-of-type(4) is div[4] as an xpath, In practice we try to avoid selectors based on index (if possible), in your case something like the below

name: "Strokies"
sceneByURL:
  - action: scrapeXPath
    url:
      - strokies.com/video/
    scraper: strokiesScraper
xPathScrapers:
  strokiesScraper:
    scene:
      Title:
        selector: //h1/text()
      Date:
        selector: //div[@class="video-info"]//p[starts-with(text(),"Added on:")]
        postProcess:
          - replace:
              - regex: '^Added on:\s*'
                with: ""
          - parseDate: Jan 2, 2006
      Details:
        selector: '//div[contains(@class, "video-text")]/div[@style="color: white;"]/p'
        concat: "\n\n"
      Performers:
        Name: //div[@class="model-tags"]/span[starts-with(text(),"Model:")]/following-sibling::a
      Tags:
        Name: //div[@class="model-tags"]/span[starts-with(text(),"Tags:")]/a
      Image:
        selector: //video/@poster
        postProcess:
          - replace:
              - regex: ^//
                with: https://
      Studio:
        Name:
          fixed: Strokies
# Last Updated March 19, 2022

bnkai avatar Mar 19 '22 11:03 bnkai

Superseded by #1247

Maista6969 avatar Feb 09 '24 23:02 Maista6969