devportal icon indicating copy to clipboard operation
devportal copied to clipboard

Canonical tags ending with .html creates redirect-canonical loop and makes the pages non-indexable

Open angelinekwan opened this issue 2 years ago • 0 comments

What's wrong?

In all pages, the canonical tags of docs.aiven.io are pointing to redirected URL versions ending with .html. For example, https://docs.aiven.io/docs/tools/api has a canonical to https://docs.aiven.io/docs/tools/api.html. This creates a redirect-canonical loop and makes the pages non-indexable.

To reproduce the issues:

Canonical tag in all pages

  1. Navigate to any doc site for example https://docs.aiven.io/ and inspect for canonical Screenshot 2023-05-30 at 15 38 58

  2. Notice the https://docs.aiven.io/index.html will redirect to https://docs.aiven.io

  3. Search engine doesn't like this and consider this a redirect loop. The canonical url should be the final redirect url in this case <link rel="canonical" href="https://docs.aiven.io">

  4. Similarly for other pages like https://docs.aiven.io/docs/platform which has wrong canonical <link rel="canonical" href="https://docs.aiven.io/docs/platform.html">. The expected canonical is without .html <link rel="canonical" href="https://docs.aiven.io/docs/platform">

Sitemap issue

  1. Open the sitemap https://docs.aiven.io/sitemap.xml and notice all the <loc> has .html and they all redirect to non html path.
<url>
<loc>https://docs.aiven.io/docs/platform.html</loc>
</url>
  1. Search engine doesn't like this and consider this a redirect loop. The canonical url should be the final redirect url in this case
<url>
<loc>https://docs.aiven.io/docs/platform</loc>
</url>

Expected behaviour

  • Canonical in the example should be https://docs.aiven.io/docs/tools/api without .html
  • Sitemap should have list of pages without .html

URL of affected page, and any other information

Issue in all pages. This issue appeared after migration from Netlify to Cloudflare Page on 22.12.2022. Apparently, Cloudflare Pages auto redirect HTML pages to their extension-less counterparts: for instance, /contact.html will be redirected to /contact, and /about/index.html will be redirected to /about/ - documentation

Notes

  • the canonical url is added automatically by sphinx based on html_baseurl.
  • sitemap is generated with sphinx extension - sphinx-sitemap based on html_baseurl

Tested

Serve page without .html extension. Reference
-b dirhtml - Build HTML pages, but with a single directory per document. Makes for prettier URLs (no .html) if served from a webserver. However the canonical url is still pointing to html extension - a known bug.

Remove the default canonical url By removing html_baseurl in conf.py and add hardcoded canonical tag in _templates/base.html. This will break the sitemap without hostname.

angelinekwan avatar Jan 11 '23 13:01 angelinekwan