CommunityScrapers icon indicating copy to clipboard operation
CommunityScrapers copied to clipboard

improve FQDN determining for Algolia

Open nrg101 opened this issue 2 years ago • 5 comments

Problem

Currently the scraper determines the scene URL hostname as domain.com, when it always seems to be www.domain.com.

www.adulttimepilots.com is an example of a site that doesn't have the normal scene pages, or at least not publicly viewable, it has an alternate scene URL format at the same domain

www.pansexualx.com is another example of a site that does not have normal scene pages (either at all or publicly), but the scenes can be viewed with the usual URL format if the domain is changed to the parent network www.evilangel.com

Solution

This change basically adds the www. to the FQDN generated for scene URLs, but also:

  • does a DNS lookup on the FQDN to see if the domain resolves to an IP
    • if it does, use that
    • if it doesn't, DNS lookup the network FQDN instead

Condition added for www.adulttimepilots.com to use alternate scene URL, e.g. https://www.adulttimepilots.com/blowing-her-neighbors-mind/

Condition added for pansexualx to use www.evilangel.com, e.g. https://www.evilangel.com/en/video/pansexualx/TS-OLIVIA-WOULD-+-Cis-Girl-Jane-Wilde/231910

Condition extended for welikegirls to use www.girlsway.com, e.g. https://www.girlsway.com/en/video/welikegirls/We-Like-Girls---Casey--Kylie/172364

nrg101 avatar Oct 05 '23 17:10 nrg101

This has bug, when you try to do "Devil's Film" the URL includes the ' so is wrong. e.g. www.devil'sfilm.com

ltgorman avatar Oct 07 '23 02:10 ltgorman

Also not sure why you're checking for specific SocketKind, it's causing what look like successful lookups to return zero results

addrinfo: [(<AddressFamily.AF_INET: 2>, 0, 0, '', ('172.67.73.252', 0)), (<AddressFamily.AF_INET: 2>, 0, 0, '', ('104.26.12.129', 0)), (<AddressFamily.AF_INET: 2>, 0, 0, '', ('104.26.13.129', 0))]
DNS records: []
Found 0 DNS records for www.girlsway.com

Maista6969 avatar Oct 07 '23 03:10 Maista6969

Thanks @Maista6969 and @ltgorman for checking/testing this.

I've decided to remove this DNS check and instead rewrite the determine_studio_name_from_json function, and some/all of the constant lists and dicts near the start of the file.

To help check the logic, I'm going through each Algolia_* scraper and looking at the studio domain or name, to match to one in Stash DB, and then traversing any parent studios, and adding a unit test for each, populated with actual scraped API data.

This should end up easier to read and maintain, and give correct results for each URL/search.

nrg101 avatar Oct 10 '23 16:10 nrg101

I really appreciate your work on this, the Algolia scraper is a bit of a hairy ball that I'd love to refactor in the long run: what are your thoughts on exposing it as a library instead of encoding all of this studio/network directly in Algolia.py? That way sites that have quirks around e.g. naming and URL schemes could keep those within their own Python scraper files but the core logic could still live in the main scraper

Not something we need to solve today, this is already a pretty big undertaking 👍

Maista6969 avatar Oct 12 '23 20:10 Maista6969

I think that's a really good idea, to have a core Algolia.py and externalise each studio/network's conventions on how to pick out the studio name, FQDN, URL format. That could maybe be possible via a newly formed configuration options object in the scraper YAML, or more likely just algolia_<studio/network>.py files that extend a base class in Algolia.py with override methods for the studio/FQDN/URL determination

nrg101 avatar Oct 13 '23 09:10 nrg101

I'm going to close this and open a new PR when I have time to revisit this

nrg101 avatar Sep 09 '24 13:09 nrg101