notfoundbot icon indicating copy to clipboard operation
notfoundbot copied to clipboard

False results for domains

Open tmcw opened this issue 4 years ago • 8 comments

  • https://www.930.com/
  • https://jade-lang.com/

These got updated to archive URLs, but they're online. Figure out.

tmcw avatar Jan 05 '21 16:01 tmcw

Okay, so observations so far:

The 930 club is using a "Sucuri Cloud Proxy" that seems to identify notfoundbot's requests as a DDoS. I've tried the basics to figure out what is informing that silly proxy that it's a bot, but haven't found anything clear so far: curl works, even if I disable all of curl's default headers.

For https://jade-lang.com/ the issue is the SSL certificate, which works for Firefox and Chrome, but not for node. Options I see so far are either disabling strict SSL checks entirely, or loading up a wider set of SSL root certs using something like https://github.com/arvind-agarwal/node_extra_ca_certs_mozilla_bundle

tmcw avatar Jan 08 '21 22:01 tmcw

I've got a few more that appear dead but aren't. https://github.com/agrc/gis.utah.gov/pull/1630/files

  • http://www.exploreutah.com/GettingAround/Navigating_Utahs_Streets.shtml
  • https://ugic.org/

curl works for both of these urls ¯_(ツ)_/¯

steveoh avatar Jan 19 '21 16:01 steveoh

Adding some more to the list:

https://dc.gov/ something is very weird with the SSL configuration on this one - the first time I curl it, I get:

➜  ~ curl https://dc.gov/
curl: (35) LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to dc.gov:443

tmcw avatar Jan 19 '21 18:01 tmcw

Thoughts on creating an exceptions list for the repeat offender links that aren't rotten?

steveoh avatar Feb 05 '21 05:02 steveoh

Yep, exactly, I think that's a great idea.

tmcw avatar Feb 07 '21 17:02 tmcw

I keep getting this false positive: https://github.com/cmudig/cmudig.github.io/pull/50. Maybe it's related.

domoritz avatar Feb 11 '21 09:02 domoritz

Yeah, trying that with curl:

$ curl https://athletics.cmu.edu/athletics/mascot/index
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Request blocked.
We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
<BR clear="all">
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: cBXZnut_EYl-4AVwmNzjF7Qkx9nmy3Z_bXdkIVDiwgxRsTAE_r1YxQ==
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>

tmcw avatar Feb 11 '21 15:02 tmcw

Cloudfront must be able to block requests with UA strings like curl/version? Any UA string that has the words curl in it fail but the following works.

curl -A "do not mention the c word" https://athletics.cmu.edu/athletics/mascot/index

Using the UA string that you use in ~linkrot~ notfoundbot works just fine.

steveoh avatar Feb 11 '21 17:02 steveoh