improve FQDN determining for Algolia
Problem
Currently the scraper determines the scene URL hostname as domain.com, when it always seems to be www.domain.com.
www.adulttimepilots.com is an example of a site that doesn't have the normal scene pages, or at least not publicly viewable, it has an alternate scene URL format at the same domain
www.pansexualx.com is another example of a site that does not have normal scene pages (either at all or publicly), but the scenes can be viewed with the usual URL format if the domain is changed to the parent network www.evilangel.com
Solution
This change basically adds the www. to the FQDN generated for scene URLs, but also:
- does a DNS lookup on the FQDN to see if the domain resolves to an IP
- if it does, use that
- if it doesn't, DNS lookup the network FQDN instead
Condition added for www.adulttimepilots.com to use alternate scene URL, e.g. https://www.adulttimepilots.com/blowing-her-neighbors-mind/
Condition added for pansexualx to use www.evilangel.com, e.g. https://www.evilangel.com/en/video/pansexualx/TS-OLIVIA-WOULD-+-Cis-Girl-Jane-Wilde/231910
Condition extended for welikegirls to use www.girlsway.com, e.g. https://www.girlsway.com/en/video/welikegirls/We-Like-Girls---Casey--Kylie/172364
This has bug, when you try to do "Devil's Film" the URL includes the ' so is wrong. e.g. www.devil'sfilm.com
Also not sure why you're checking for specific SocketKind, it's causing what look like successful lookups to return zero results
addrinfo: [(<AddressFamily.AF_INET: 2>, 0, 0, '', ('172.67.73.252', 0)), (<AddressFamily.AF_INET: 2>, 0, 0, '', ('104.26.12.129', 0)), (<AddressFamily.AF_INET: 2>, 0, 0, '', ('104.26.13.129', 0))]
DNS records: []
Found 0 DNS records for www.girlsway.com
Thanks @Maista6969 and @ltgorman for checking/testing this.
I've decided to remove this DNS check and instead rewrite the determine_studio_name_from_json function, and some/all of the constant lists and dicts near the start of the file.
To help check the logic, I'm going through each Algolia_* scraper and looking at the studio domain or name, to match to one in Stash DB, and then traversing any parent studios, and adding a unit test for each, populated with actual scraped API data.
This should end up easier to read and maintain, and give correct results for each URL/search.
I really appreciate your work on this, the Algolia scraper is a bit of a hairy ball that I'd love to refactor in the long run: what are your thoughts on exposing it as a library instead of encoding all of this studio/network directly in Algolia.py? That way sites that have quirks around e.g. naming and URL schemes could keep those within their own Python scraper files but the core logic could still live in the main scraper
Not something we need to solve today, this is already a pretty big undertaking 👍
I think that's a really good idea, to have a core Algolia.py and externalise each studio/network's conventions on how to pick out the studio name, FQDN, URL format. That could maybe be possible via a newly formed configuration options object in the scraper YAML, or more likely just algolia_<studio/network>.py files that extend a base class in Algolia.py with override methods for the studio/FQDN/URL determination
I'm going to close this and open a new PR when I have time to revisit this