Dup detection with www.domain.com doesn't check for domain.com and vice-versa

Open cointastical opened this issue 3 years ago • 1 comments

If a website is accessible via either www. subdomain or without a subdomain, SN could help lessen the chance that different users will try to post the same content by checking for www. subdomain as part of its dup detection.

For example, currently, the first of the following would be a duplicate of the second:

https://podpage.com/citadeldispatch/cd68-a-sunday-afternoon-chat-with-clarkmoody/
https://www.podpage.com/citadeldispatch/cd68-a-sunday-afternoon-chat-with-clarkmoody/

But posting either when the other already exists on SN will, currently, not cause SN to show the warning about it being a duplicate.

In the above example, the first will actually get an HTTP 301 redirect which includes the www. subdomain. So detecting HTTP redirects might be one way of checking that it might be a duplicate. But some web servers are (poorly) configured to let the response be either -- i.e., without any redirect, but www.domain.com serves the same content as domain.com

And this can happen with other subdomains as well. For instance, a URL might see redirect to a specific language, such as what Wikipedia does:

https://wikipedia.org/wiki/Bitcoin
https://en.wikipedia.org/wiki/Bitcoin

That's a particularly gnarly one. The first will get an HTTP 301 redirect to add the www. subdomain, ... and the request for that then gets an HTTP 301 redirect where the language (en.) subdomain replaces the www. subdomain (i.e., en.wikipedia.org/... )

So without doing further research and analysis, I can't say what the best solution is offhand. I do think simply stripping any www. prefix before doing dup detection (and indexing the URL of existing posts without the www. and using that for the dup detection) takes care of nearly every instance that I could envision occurring.

Jun 28 '22 00:06 cointastical

At target state, you probably want to either crawl the actual content and compare that, or fetch/buy/maintain DNS records to see what's what.

Way before that, I would:

Find a library to pull a fully-qualified domain name from the URL. Google suggest parse-domain or tldts.
Use the registered domain in the business logic of detection, ignore sub-domain completely.

Aug 06 '22 01:08 jnmclarty