Replace xpath scraper for LifeSelector.com / 21Roles.com with Python scraper
This adds a Python-based scraper for:
- lifeselector.com
- 21roles.com
About the scraper
The scraper combines HTML scraping and REST API calls to get info for the following:
- movie (called "game" on each site, as they are "choose your own adventure" type games where you can make a choice between different scenes)
- scene (each game on the site contains multiple scenes, these are distinguished by the 2nd and subsequent images in the image background carousel on the game page)
- the performers for the entire movie are scraped, the user can add/remove as applicable to the scene
- the tags for the entire movie are scraped, the user can add/remove as applicable to the scene
- the movie synopsis describes the whole multi-scene release, but is used as the scene details... a user can choose to replace this text with the text in the scene cover image
- the individual scene image is scraped!
- gallery (shares common info with the corresponding movie)
- performer
The sites have the following areas:
- galleries
- movies ("games")
- performers
So the following work:
- galleryByFragment (with either
urlortitlepopulated) - galleryByURL
- movieByURL
- performerByFragment (with either
urlornamepopulated) - performerByURL
- sceneByName
- sceneByQueryFragment
There is no such thing as a scene URL on these sites, so sceneByURL and sceneByFragment are absent from the scraper.
About the base_python_scraper.py
I made a BasePythonScraper class in py_common/base_python_scraper.py that provides the following features:
- loads a nationality,country CSV and stores it as a dict
- provides a nationality to country converter method
_get_country_for_nationality(nationality)
- provides a nationality to country converter method
- provides a date string converter method
_convert_date(date_string, format_in, format_out) - loads the arguments specified in the scraper YAML
- loads the JSON (fragment) sent by stashapp
- maps the action to a class method (that is executed to produce the result)
- calling the string value of the class returns the JSON representation of the result
This means that Python scrapers can be made to extend the base class and not need to create the stash integration from scratch each time, but just create a derived/child/sub class and override the methods corresponding to the script actions in your scraper YAML.
# your_new_scraper.yaml
name: "Your New Scraper"
performerByURL:
- action: script
url:
- domain.com/model/
- domain2.com/model/
script:
- python
- your_new_scraper.py
- performerByURL
For example, say you make a new Python scraper that does performerByURL, the code is quite minimal, e.g.
'''
your_new_scraper.py
'''
import requests
from py_common import base_python_scraper
class YourNewScraper(base_python_scraper.BasePythonScraper):
'''
Implemented script actions and helper functions
'''
def _get_performer_by_url(self, url: str) -> dict:
'''
Get performer properties by URL sent by stashapp
'''
performer = {}
# do your HTML scraping and/or API calls here
api_result = requests.get(api_url).json()
performer['name'] = api_result['name']
performer['country'] = self._get_country_for_nationality(api_result['nationality'])
return performer
if __name__ == '__main__':
result = YourNewScraper()
print(result)
All you have to do is build up the properties of the performer dict and return it, and the base class functionality takes care of the input and output handling.
Where the class is instantiated in:
result = YourNewScraper()
it loads in all the input parameters and processes the requested scraping action, storing the result as a dict in self.result. You don't have to call the class member variable, just requesting the class instance as a string makes it return the JSON representation that stashapp requires as the response, so all you have to do is:
print(result)
About the python scraper tests
I created some unit tests in py_tests:
test_base_python_scraper.py: a test forbase_python_scraper.py(more specifically, theBasePythonScraperclass within it)test_life_selector.py: a test forlife_selector.py(more specifically, theLifeSelectorScraperclass within it)
This was mostly for my own sanity while developing the base class, and may well not have full code coverage, but is definitely better than nothing and ensured at least some coding standard / class behaviour.
If the original xPath scraper still works it might nice to preserve it and add Python as a new scraper similar to how others are handled.
If the original xPath scraper still works it might nice to preserve it and add Python as a new scraper similar to how others are handled.
The problem with the current scraper is mainly that the sceneByURL is implemented to grab the info for the entire multi-scene release. The sites, lifeselector.com and 21roles.com, call the releases shows or games, but they are basically movies with several scenes, as you can see by the 2nd to last images in the background image carousel.
TL;DR: the current version implements sceneByURL to incorrectly grab the "movie" info, i.e.
- the image is the movie cover image (you can only scrap the first image in the background image carousel, the other images in the carousel are for each scene, and those individual scene cover images should be used for each scene. there are/were many entries in stashdb where the movie cover was used with all the performers on, instead of the scene cover picturing just the scene performers)
- the scraped title is the movie title, and doesn't distinguish between scenes... this is annoying when there a few scenes in the movie, and especially if people are scraping the movie cover image for all of them
I can reinstate the current (imo bad/wrong) scraper if you want, and just chastise people on stashdb.org when they submit edits scraped by the xpath scraper, but that would just be annoying when people could just use this scraper and scrape the scenes properly.
It's more than just a refactor, it's solving an issue of edit quality/accuracy that exists for these sites due to the way they don't have individual scene URLs, but rather a movie URL that shows the scenes as images on a background image carousel.
Thank you for explaining. I agree it makes sense to replace it in this case, wouldn't be the first time StashDB guidelines influenced scrapers. Ultimately it's bnkai decision.
Imho it is better to replace the original scraper. I will have a look asap.
I will have a look asap.
I'm not sure if you just mean review this PR, or look at creating a replacement scraper. In case you thought it was the latter, this PR is a working solution for a replacement
I'm going to close this and open a new PR when I have time to revisit this