PyFunceble icon indicating copy to clipboard operation
PyFunceble copied to clipboard

FEATURE: Implement check the availability of website / detection of parked & for sale domains

Open keczuppp opened this issue 2 years ago • 10 comments

Description

I have been thinking about it even before I've found this: https://github.com/StevenBlack/hosts/issues/1613#issuecomment-820162550 @AdKiller

As many as dead (non-existent domain etc) domains, there are many parked / for sale zombie domains, which still do exist but are dead at the same time...: "This domain is for sale", "Website is no longer available", currenlty PyFunceble marks such domain as ACTIVE...but PyFunceble could have an option to check whether a real website exists on the domain and mark it "REAL" if not, then "ZOMBIE" label.

Possible Solution

Such feature:

  • would require downloading a domain's main body and searching in the body for text phrases like: "for sale", "no longer available", "rent domain" etc,
  • could lead to some false positives
  • a pity there is no status code for it: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes, would not require downloading the whole site's body and searching for text phrases

Screenshoot

example

keczuppp avatar Oct 04 '21 11:10 keczuppp

The trick is to setup your own DNS recursor and then import (Setup, configure) it with the RPZ zone pirated.mypdns.cloud

The reason is you then will be blocking through the .rpz-nsdname The next thing you have to do is to disable the http status code and whois check and purely relay on the DNS test.

You can now do 2 things with the results file domain/INACTIVE/list

  1. Add them as pirtared domain and block them
  2. use a removal tool like sed and remove them from your source

This is a much safer approach than than trying to keep a up to date set of rules, it will also add the ip to the bot lists.

Give it a spin, there are some getting started config for Power dns Recursor here: https://mypdns.org/rpz/dns-rpz-integration/-/tree/master/PowerDNS-Recursor

If you having trouble getting it to work, please do open a issue or a discussion

spirillen avatar Oct 04 '21 19:10 spirillen

The idea is good @keczuppp.

My problem right now is that the same webpage as in the screenshot gives me the text in German (IP-based). Let's keep this open until I find a way to implement it - somehow ...

funilrys avatar Oct 12 '21 17:10 funilrys

spirillen : https://github.com/funilrys/PyFunceble/issues/255#issuecomment-933797906

I don't know much about these things, I would have to study all that stuff first

funilrys : https://github.com/funilrys/PyFunceble/issues/255#issuecomment-941221455: text in German (IP-based).

  • in this case it's based not purely on IP, but on language of the browser in the first place, by one of the methods, have not checked so far, whether the website will fallback to IP-based localize detection when not using a browser to get a body, but yeah, if the website will fall back, then it complicates searching for text phrases, there is just too many languages, but the phrase search method will work on many other sites, which don't provide localized messages, so we should not resign from phrase search method
  • also another problem is that in this case, the example text phrase "for sale", regardless of the site's language, is not even present in raw source code of the page, but is generated from javascript file, when loading and parsing by the browser, so searching for "for sale" text phrase in the source code won't work in this case (returns 0 results), but on many other parked sites will still work of course, so we should not resign from text search method
  • we shoud search body for some other unique text identifiers / links as well, as most of parked domains have some unique links to the main parking server, example: https://publicwww.com/websites/parking.bodiscdn.com/ , going trough the results, all are parked zombies, no functional websites, almost 10 000...there could be created a list of such parking-links, if a parking-link is present in the body's source, that means this is a parked / for sale domain

keczuppp avatar Oct 12 '21 19:10 keczuppp

Another idea, from jawz101 : https://github.com/easylist/easylist/issues/2374#issuecomment-946087387 :

OP here- I just want to say this is very impressive work.

@ ryanbr @ felix-22

I have a suggestion and I may propose it to @ funilrys for the PyFunceble utility. I think I have done so before. For the past couple of years I've been using the Cisco Umbrella (formerly OpenDNS) Top 1 Million daily DNS lookup reports they publish here to evaluate the adaway list.

If you are unfamiliar with OpenDNS, it is a public DNS service which has been around for longer than most other public resolvers which allowed for content filtering and malware/phishing protection. They make lists of the top 1 million name lookups records publicly available. "The OpenDNS Global Network processes an estimated 100 billion DNS queries daily from 85 million users through 25 data centers worldwide."

So, regardless of if a domain is registered- these are the actual queries made by us in circulation. If a domain is parked, it's going to be valid but nothing is pointing to it so you'll never see it used. I will download, say, the past 3 months of logs and if no one has tried to lookup a domain, I pull it from the adaway list.

edit- crazy. I just did it with 2 days top 1million files and of the 6,999/25,556 were on the top 1 million lists.

keczuppp avatar Oct 19 '21 10:10 keczuppp

@keczuppp wrote in https://github.com/funilrys/PyFunceble/issues/255#issuecomment-946580250

Another idea, from jawz101 : easylist/easylist#2374 (comment) :

Touching https://github.com/funilrys/PyFunceble/issues/128

@keczuppp wrote in https://github.com/funilrys/PyFunceble/issues/255#issuecomment-941369292

I don't know much about these things, I would have to study all that stuff first

You should do that as it will enhance your system performances significantly: you should read Performance test of Hosts file vs DNS-Recursors :wink:

spirillen avatar Oct 19 '21 10:10 spirillen

@keczuppp a early version is available in the parked-subject branch. However, I'm not sure if it is necessary to create 2 new status: REAL and ZOMBIE (or similar) ... With that commit, the tested will be subject will be treated as INACTIVE.

Does that fit everyone's needs? If it does, I will proceed with merging the branch to the dev version of PyFunceble.

Asking for inputs: @spirillen @mitchellkrogza @ZeroDot1 and others

funilrys avatar Oct 08 '22 13:10 funilrys

@funilrys

A few thought :thought_balloon:

  1. For determine pirated domains you should be following https://mypdns.org/infrastructure/dante-commit-bot/-/issues/15, this is where we are building the safe list for marking domains as parked/hijacked/sharked
    1. Why?: a higher number of these "parked" domains is used for phishing
  2. You might consider including our pirated project lists to enhance the positive hit lists
  3. (Haven't check the --help) But there should be a switch for (en|dis)-abling this feature and/or maybe even adding own source for known pirated domains (Should be a very trustworthy source as some of these domains actually do get sold and reactivated)

I might get back with more, when tested

spirillen avatar Oct 10 '22 12:10 spirillen

Feature has been disabled because I need to gather more intel on how people will use this:

  • As a new test (like syntax, availability, reputation) option.
  • As a SPECIAL rule.

funilrys avatar Nov 26 '22 18:11 funilrys

Personally I would say the --pirated option is best as it allows the individual to chose for them self and yet allows them to use the --special-lookup

As more things is user optional, the better your modular approach is accomplished.

spirillen avatar Nov 28 '22 13:11 spirillen

Pardon for not writing back sooner, but I have not been very active on github lately, I only noticed yesterday.

funilrys: a early version is available in the parked-subject branch.

Cool.

funilrys: However, I'm not sure if it is necessary to create 2 new status: REAL and ZOMBIE (or similar) ... With that commit, the tested will be subject will be treated as INACTIVE.

I'm not sure either, I was merely speculating whether it could be useful or not, for the users to distinguish between reasons for which a domain is inactive, but I have no idea whether such a distinction is useful for the users from a practical point of view, so if it is not, in that case we can stick to marking an inactive domain simply inactive, regardless of the reason.

funilrys: Asking for inputs: ... and others

As for the phrases to look for ( LINK 1 ) they seem good, however it seems this one: .com is for sale is worth including as well ( LINK 2 )

Also like I mentioned before ( LINK 3 ), we should search the raw body not only for typical word phrases, but also for other text values because:

  • some parked domains don't provide natively embeded text messages in raw body about domain being for sale, but instead they keep them in JS scripts and generate during page loading, which is out of scope for the tool, because it's not an internet browser
  • even worse, some parked domains don't provide at all any kind of text message about domain being parked
  • a random example of such parked domain which can't be found by common phrases is: av4.xyz - this parked domain can be indentified as parked only by some of the other values like:
    • unique dom class element's name: .comp-is-parked or .sale_link
    • unique JS script variable's name: tcblock
    • unique external JS script name: "maincaf.js"
    • unique link: parkingcrew.net/assets
    • unique domain name (SLD): d38psrni17bvxu
  • I'm providing a table with millions of parked domains in PublicWWW that can be identified additionally or sometimes only by various values other than typical word phrases, it's worth considering adding these values to the search list:
Value of Identifier Type of Identifier Current ammount of domains (31.12.2022) Previous ammount of domains (02.12.2021) Change
tcblock JS variable name over 1 000 000 over 1 000 000 unknown
js3caf JS script partial name over 1 000 000 over 1 000 000 unknown
d1lxhc4jvstzrp Domain name (SLD) over 1 000 000 over 1 000 000 unknown
d38psrni17bvxu Domain name (SLD) over 1 000 000 - unknown
"maincaf.js" JS Script name 937 500 - unknown
"for_sale_lander.css" Stylesheet filename 720 522 over 1 000 000 noticeable decerase
"LANDER_SYSTEM" JS script object name over 1 000 000 over 1 000 000 unknown
"img1.wsimg.com/parking-lander/static" JS Script Src Chunk Link over 1 000 000 over 1 000 000 unknown
"img.sedoparking.com" Image/Banner Link over 1 000 000 937 337 unknown incerase
"i.cdnpark.com/themes/registrar/images/logo_namecheap.png" Image/Banner Link 511 955 285 793 double incerase
"traffic.club" Domain name 188 283 014 died
"Account Suspended" DOM Title name 213 682 198 156 small incerase
"framework.syrahost.com/dist/crazydomains/parked.css?" Stylesheet filepath + filename 292 426 165 689 double incerase
"cdn-staging.domainmarket.com/static-landers/assets/js/main.js" JS script filepath + filename 287 161 552 died
"domainparking.ru/privacy-policy" Link to Privacy Policy 124 678 133 541 small decerase
"brokerage.domainbrokers.se/?domain=" Domain name + request 6 567 83 703 died
"ewebdevelopment.com/quotes/inquire/" Link 80 912 80 013 not changed
"parking.bodiscdn.com" Domain name 795 081 55 314 extreme incerase ( x14 )
"shop.ename.com" Domain Shop name 32 884 49 518 noticeable decerase
"d1s9zexeqsmc0t" Domain name (SLD) 582 34 169 died
"Start Domain For Sale Box" Word Phrase 612 27 986 died
"/porkbun.com/checkout/addCartItems?items[marketplace]=" Domain name + request 12 110 13 056 small decerase

keczuppp avatar Jan 01 '23 14:01 keczuppp