About the scraper

The scraper combines HTML scraping and REST API calls to get info for the following:

movie (called "game" on each site, as they are "choose your own adventure" type games where you can make a choice between different scenes)
scene (each game on the site contains multiple scenes, these are distinguished by the 2nd and subsequent images in the image background carousel on the game page)
- the performers for the entire movie are scraped, the user can add/remove as applicable to the scene
- the tags for the entire movie are scraped, the user can add/remove as applicable to the scene
- the movie synopsis describes the whole multi-scene release, but is used as the scene details... a user can choose to replace this text with the text in the scene cover image
- the individual scene image is scraped!
gallery (shares common info with the corresponding movie)
performer

The sites have the following areas:

galleries
movies ("games")
performers

So the following work:

galleryByFragment (with either url or title populated)
galleryByURL
movieByURL
performerByFragment (with either url or name populated)
performerByURL
sceneByName
sceneByQueryFragment

There is no such thing as a scene URL on these sites, so sceneByURL and sceneByFragment are absent from the scraper.

About the `base_python_scraper.py`

I made a BasePythonScraper class in py_common/base_python_scraper.py that provides the following features:

loads a nationality,country CSV and stores it as a dict
- provides a nationality to country converter method _get_country_for_nationality(nationality)
provides a date string converter method _convert_date(date_string, format_in, format_out)
loads the arguments specified in the scraper YAML
loads the JSON (fragment) sent by stashapp
maps the action to a class method (that is executed to produce the result)
calling the string value of the class returns the JSON representation of the result

This means that Python scrapers can be made to extend the base class and not need to create the stash integration from scratch each time, but just create a derived/child/sub class and override the methods corresponding to the script actions in your scraper YAML.

# your_new_scraper.yaml
name: "Your New Scraper"
performerByURL:
  - action: script
    url:
      - domain.com/model/
      - domain2.com/model/
    script:
      - python
      - your_new_scraper.py
      - performerByURL

For example, say you make a new Python scraper that does performerByURL, the code is quite minimal, e.g.

'''
your_new_scraper.py
'''
import requests
from py_common import base_python_scraper

class YourNewScraper(base_python_scraper.BasePythonScraper):
    '''
    Implemented script actions and helper functions
    '''
    def _get_performer_by_url(self, url: str) -> dict:
        '''
        Get performer properties by URL sent by stashapp
        '''
        performer = {}
        # do your HTML scraping and/or API calls here
        api_result = requests.get(api_url).json()
        performer['name'] = api_result['name']
        performer['country'] = self._get_country_for_nationality(api_result['nationality'])
        return performer

if __name__ == '__main__':
    result = YourNewScraper()
    print(result)

All you have to do is build up the properties of the performer dict and return it, and the base class functionality takes care of the input and output handling.

Where the class is instantiated in:

    result = YourNewScraper()

it loads in all the input parameters and processes the requested scraping action, storing the result as a dict in self.result. You don't have to call the class member variable, just requesting the class instance as a string makes it return the JSON representation that stashapp requires as the response, so all you have to do is:

    print(result)

About the python scraper tests

I created some unit tests in py_tests:

test_base_python_scraper.py: a test for base_python_scraper.py (more specifically, the BasePythonScraper class within it)
test_life_selector.py: a test for life_selector.py (more specifically, the LifeSelectorScraper class within it)

This was mostly for my own sanity while developing the base class, and may well not have full code coverage, but is definitely better than nothing and ensured at least some coding standard / class behaviour.

Mar 12 '23 20:03 nrg101

If the original xPath scraper still works it might nice to preserve it and add Python as a new scraper similar to how others are handled.

Mar 18 '23 19:03 DogmaDragon

If the original xPath scraper still works it might nice to preserve it and add Python as a new scraper similar to how others are handled.

The problem with the current scraper is mainly that the sceneByURL is implemented to grab the info for the entire multi-scene release. The sites, lifeselector.com and 21roles.com, call the releases shows or games, but they are basically movies with several scenes, as you can see by the 2nd to last images in the background image carousel.

TL;DR: the current version implements sceneByURL to incorrectly grab the "movie" info, i.e.

the image is the movie cover image (you can only scrap the first image in the background image carousel, the other images in the carousel are for each scene, and those individual scene cover images should be used for each scene. there are/were many entries in stashdb where the movie cover was used with all the performers on, instead of the scene cover picturing just the scene performers)
the scraped title is the movie title, and doesn't distinguish between scenes... this is annoying when there a few scenes in the movie, and especially if people are scraping the movie cover image for all of them

I can reinstate the current (imo bad/wrong) scraper if you want, and just chastise people on stashdb.org when they submit edits scraped by the xpath scraper, but that would just be annoying when people could just use this scraper and scrape the scenes properly.

It's more than just a refactor, it's solving an issue of edit quality/accuracy that exists for these sites due to the way they don't have individual scene URLs, but rather a movie URL that shows the scenes as images on a background image carousel.

Mar 18 '23 23:03 nrg101

Thank you for explaining. I agree it makes sense to replace it in this case, wouldn't be the first time StashDB guidelines influenced scrapers. Ultimately it's bnkai decision.

Mar 19 '23 02:03 DogmaDragon

Imho it is better to replace the original scraper. I will have a look asap.

Mar 19 '23 19:03 bnkai

I will have a look asap.

I'm not sure if you just mean review this PR, or look at creating a replacement scraper. In case you thought it was the latter, this PR is a working solution for a replacement

Apr 20 '23 10:04 nrg101

I'm going to close this and open a new PR when I have time to revisit this

Sep 09 '24 13:09 nrg101

Replace xpath scraper for LifeSelector.com / 21Roles.com with Python scraper

About the scraper

About the base_python_scraper.py

About the python scraper tests

About the `base_python_scraper.py`