devportal
devportal copied to clipboard
Canonical tags ending with .html creates redirect-canonical loop and makes the pages non-indexable
What's wrong?
In all pages, the canonical tags of docs.aiven.io are pointing to redirected URL versions ending with .html. For example, https://docs.aiven.io/docs/tools/api has a canonical to https://docs.aiven.io/docs/tools/api.html. This creates a redirect-canonical loop and makes the pages non-indexable.
To reproduce the issues:
Canonical tag in all pages
-
Navigate to any doc site for example https://docs.aiven.io/ and inspect for
canonical
-
Notice the https://docs.aiven.io/index.html will redirect to https://docs.aiven.io
-
Search engine doesn't like this and consider this a redirect loop. The canonical url should be the final redirect url in this case
<link rel="canonical" href="https://docs.aiven.io">
-
Similarly for other pages like https://docs.aiven.io/docs/platform which has wrong canonical
<link rel="canonical" href="https://docs.aiven.io/docs/platform.html">
. The expected canonical is without .html<link rel="canonical" href="https://docs.aiven.io/docs/platform">
Sitemap issue
- Open the sitemap https://docs.aiven.io/sitemap.xml and notice all the
<loc>
has .html and they all redirect to non html path.
<url>
<loc>https://docs.aiven.io/docs/platform.html</loc>
</url>
- Search engine doesn't like this and consider this a redirect loop. The canonical url should be the final redirect url in this case
<url>
<loc>https://docs.aiven.io/docs/platform</loc>
</url>
Expected behaviour
- Canonical in the example should be https://docs.aiven.io/docs/tools/api without .html
- Sitemap should have list of pages without .html
URL of affected page, and any other information
Issue in all pages. This issue appeared after migration from Netlify to Cloudflare Page on 22.12.2022. Apparently, Cloudflare Pages auto redirect HTML pages to their extension-less counterparts: for instance, /contact.html will be redirected to /contact, and /about/index.html will be redirected to /about/ - documentation
Notes
- the canonical url is added automatically by sphinx based on html_baseurl.
- sitemap is generated with sphinx extension - sphinx-sitemap based on
html_baseurl
Tested
❌ Serve page without .html extension.
Reference
-b dirhtml
- Build HTML pages, but with a single directory per document. Makes for prettier URLs (no .html) if served from a webserver. However the canonical url is still pointing to html
extension - a known bug.
❌ Remove the default canonical url
By removing html_baseurl
in conf.py and add hardcoded canonical tag in _templates/base.html. This will break the sitemap without hostname.