kedro Improve SEO and maintenance of documentation versions

In #2980 we discussed about the fact that too many Kedro versions appear in search results.

We fixed that in #3030 by manually controlling what versions did we want to be indexed.

This caused a number of issues though, most importantly #3710: we had been accidentally excluded our subprojects from our search results.

We fixed that in #3729 in a somewhat unsatisfactory fashion. In particular, there are concerns about consistency and maintainability https://github.com/kedro-org/kedro/pull/3729#discussion_r1539480799 (see also https://github.com/kedro-org/kedro/issues/2600#issuecomment-1772673087 about the problem of projects under kedro-org/kedro-plugins not having a stable version).

In addition, my mind has evolved a bit and I think we should only index 1 version in search engines: stable. There were concerns about users not understanding the flyout menu https://github.com/kedro-org/kedro/issues/2980#issuecomment-1705277829 and honestly the latest part is also quite confusing (#2823, https://github.com/readthedocs/readthedocs.org/issues/10674) but that's a whole separate discussion.

For now, the problems we want to solve are

#2980 again (not reopening it, hence this issue) by allowing only 1 version, the most recent stable one, and
The ongoing maintenance of robots.txt, ideally by not having to ever touch it again.

Mar 26 '24 15:03 astrojuanlu

And on the topic of the flyout, here's my thinking:

I think I have become numb to the whole stable/latest from RTD, but I think @stichbury is right this is not at all obvious.

Now, look at what happens when I change the default version to be 0.19.3 instead of the current stable:

User types https://docs.kedro.org
User gets redirected to https://docs.kedro.org/en/0.19.3/
The flyout shows the number:

Screenshot 2024-03-26 at 15-34-38 Welcome to Kedro’s award-winning documentation! — kedro 0 19 3 documentation

By having the number in the URL and also in the flyout by default, I think it's more obvious how the user should go and switch to their version of choice.

@stichbury in your opinion, do you think this would make our docs journey more palatable?

Mar 26 '24 15:03 astrojuanlu

I think this is good, but doesn't it mean that you have to remember to increment the version number for stable in the control panel each time you make a release? If you don't it makes it hard to find the docs for that release (which incidentally are the latest stable 🤦 docs).

Mar 26 '24 16:03 stichbury

doesn't it mean that you have to remember to increment the version number for stable in the control panel each time you make a release?

It does... but sadly RTD doesn't allow lots of customization about the versioning rules for now. It's a small price to pay though, would happen only a handful of times per year.

Mar 26 '24 16:03 astrojuanlu

TIL: robots.txt and pages actually indexed by Google are completely orthogonal https://github.com/kedro-org/kedro/issues/3708#issuecomment-2021054440

Mar 26 '24 18:03 astrojuanlu

To note, RTD has automation rules https://docs.readthedocs.io/en/stable/automation-rules.html#actions-for-versions although the stable/latest rules are unfortunately implicit https://github.com/readthedocs/readthedocs.org/issues/5319

Apr 02 '24 06:04 astrojuanlu

I think the /stable/:splat -> /page/:splat redirection trick we got recommended in https://github.com/readthedocs/readthedocs.org/issues/11183#issuecomment-2032403457 can also solve the long standing problem of not having stable versions for repos in kedro-plugins https://github.com/kedro-org/kedro/issues/2600#issuecomment-1772673087

Here's the 📣 proposal

We turn /stable into a redirection to, well, the most recent stable version, in all subprojects (framework, viz, datasets)
All links to /stable will keep working, but instead of staying in /stable, they will get automatically redirected to the corresponding version, for example /0.19.4 or /projects/kedro-datasets/3.0.0
/latest will continue being /latest because it's not possible to rename it https://github.com/readthedocs/readthedocs.org/issues/10674 (but will continue having a "This is the latest development version" banner that we can tweak with CSS in the future)

The only thing we need to understand is what would be the impact on indexing and SEO cc @noklam @ankatiyar

Thoughts @stichbury ?

Apr 12 '24 08:04 astrojuanlu

I've somewhat lost track of what your robots.txt changes have been, but as I understand it, you want to index just 1 version and this would be stable and this would be what is shown in search results (but in fact, if the user navigates to stable they're redirected to a numbered version). Is this workable -- does the google crawler cope with redirects?

I would personally consider if it's sufficient to just keep stable as the indexed version and avoid the redirecting shenanigans. It is introducing complexity which makes maintenance harder. I understand the reasoning (I think, you can brief me in our next call) but is this helping users? (I think most users can cope with the concept of "stable" after all and some may actively seek it out). Let's discuss on Monday but if you need/want to go ahead in the meantime, please do, under some vague level of advisement!

Apr 12 '24 09:04 stichbury

In principle this is related to our indexing strategy, robots.txt etc but goes beyond that, it's more about keeping /stable as something our users get used to, or moving away from that to establish consistency across the subprojects.

Let's chat next week

Apr 12 '24 12:04 astrojuanlu

Renamed this issue to better reflect what should we do here.

In https://github.com/readthedocs/readthedocs.org/issues/10648#issuecomment-2021128135, RTD staff gave an option to inject meta noindex tags on the docs depending on the versioning. That technique is very similar to the one described in https://www.stevenhicks.me/blog/2023/11/how-to-deindex-your-docs-from-google/ (discovered by @noklam).

It's clear that we have to shift our strategy by:

Avoid mangling robots.txt going forward
Improve how we craft our sitemaps
Add some templating tricks to our docs so proper meta noindex and link rel=canonical HTML tags are properly generated

May 21 '24 15:05 astrojuanlu

Today I had to manually index https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.0 on Google (maybe there are no inbound links?) and I couldn't index 3.0.1 (it's currently blocked by our robots.txt).

Jun 05 '24 07:06 astrojuanlu

Summary of things to do here:

[ ] Stop manually crafting our robots.txt, use the default one generated by Read the Docs (docs)
[ ] Add some logic to our kedro-sphinx-theme so that rel=canonical links pointing to /stable are inserted in older versions as suggested in https://github.com/readthedocs/readthedocs.org/issues/10648#issuecomment-2021128135
[ ] Consider making those changes retroactive for a few versions, and if too much work or not feasible, propose alternatives
[ ] Pause and evaluate results of efforts so far
[ ] Consider crafting a sitemap.xml manually (docs)

Refs: https://www.stevenhicks.me/blog/2023/11/how-to-deindex-your-docs-from-google/, https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls

Jul 16 '24 12:07 astrojuanlu

Today I've been researching about this again (yeah, I have weird hobbies...)

I noticed that projects hosted on https://docs.rs don't seem to exhibit these SEO problems, and also that they seemingly take a basic, but effective, approach.

Compare https://docs.rs/clap/latest/clap/ with https://docs.rs/clap/2.34.0/clap/. There is no trace of <meta noindex,nofollow tags.

What they do though is having very lean sitemaps. If you look at https://docs.rs/-/sitemap/c/sitemap.xml, there's only 2 entries for clap:

<url>
            <loc>https://docs.rs/clap/latest/clap/</loc>
            <lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
            <priority>1.0</priority>
        </url>
        <url>
            <loc>https://docs.rs/clap/latest/clap/all.html</loc>
            <lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
            <priority>0.8</priority>
        </url>

Compare it with https://docs.kedro.org/sitemap.xml, which is, in comparison... less than ideal:

  <url>
    <loc>https://docs.kedro.org/en/stable/</loc>
    
    
    <lastmod>2024-08-01T18:53:11.571849+00:00</lastmod>
    
    <changefreq>weekly</changefreq>
    <priority>1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/latest/</loc>
    
    
    <lastmod>2024-08-09T09:39:27.628501+00:00</lastmod>
    
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.7/</loc>
    
    
    <lastmod>2024-08-01T18:53:11.647322+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.6/</loc>
    
    
    <lastmod>2024-05-27T16:32:42.584307+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.7</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.5/</loc>
    
    
    <lastmod>2024-04-22T11:56:55.928132+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.6</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.4.post1/</loc>
    
    
    <lastmod>2024-05-17T12:25:27.050615+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
...

The way I read this is that RTD is treating tags as long-lived branches, and as a result telling search engines that docs of old versions will be updated monthly, which in our current scheme is incorrect.

I am not sure if this is something worth reporting to RTD, but maybe we should look at uploading a custom sitemap.xml before doing the whole retroactive meta tag story.

Aug 10 '24 11:08 astrojuanlu

Reopening until we solve the issue (whether improving the sitemaps, retroactively changing the tags, or painting a pentagon with a turkey's head...)

Aug 21 '24 16:08 astrojuanlu

Added @DimedS to our Google Search Console, hope this will help!

Aug 28 '24 08:08 astrojuanlu

Thank you for bringing this new idea and providing me access to GSC, @astrojuanlu!

What I understand after investigation:

After we switched to an auto-generated robots.txt file following the last release, we received the following version of it:

User-agent: *
Disallow: /en/develop/ # Hidden version
Disallow: /en/0.19.4.post1/ # Hidden version
Disallow: /en/0.19.3.post1/ # Hidden version
Disallow: /en/0.19.2.post1/ # Hidden version
Disallow: /en/0.19.1.post1/ # Hidden version
Disallow: /en/0.19.0.post1/ # Hidden version
Disallow: /en/0.17.7/ # Hidden version
Disallow: /en/0.17.6/ # Hidden version
Disallow: /en/0.17.5/ # Hidden version
Disallow: /en/0.17.4/ # Hidden version
Disallow: /en/0.17.3/ # Hidden version
Disallow: /en/0.17.2/ # Hidden version
Disallow: /en/0.17.1/ # Hidden version
Disallow: /en/0.17.0/ # Hidden version
Disallow: /en/0.16.6/ # Hidden version
Disallow: /en/0.16.5/ # Hidden version
Disallow: /en/0.16.4/ # Hidden version
Disallow: /en/0.16.3/ # Hidden version
Disallow: /en/0.16.2/ # Hidden version
Disallow: /en/0.16.1/ # Hidden version
Disallow: /en/0.16.0/ # Hidden version
Disallow: /en/0.15.9/ # Hidden version
Disallow: /en/0.15.8/ # Hidden version
Disallow: /en/0.15.7/ # Hidden version
Disallow: /en/0.15.6/ # Hidden version
Disallow: /en/0.15.5/ # Hidden version
Disallow: /en/0.15.4/ # Hidden version
Disallow: /en/0.15.3/ # Hidden version
Disallow: /en/0.15.2/ # Hidden version
Disallow: /en/0.15.0/ # Hidden version
Disallow: /en/0.14.3/ # Hidden version
Sitemap: https://docs.kedro.org/sitemap.xml

This file is large and somewhat inconvenient. More importantly, it does not disallow many older versions, such as everything related to 0.18.0-0.18.12, as well as the /projects folder containing all versions of Viz and Datasets.

As a consequence, the number of indexed pages significantly increased after the release: This has led to many older versions of the documentation being indexed, which we do not want, as shown here:
As I understand it, we cannot fix this situation with the sitemap.xml alone, as that file plays only a supporting role in the indexing process. The primary control lies with robots.txt. You can observe this effect now because the /projects folder containing Viz and Datasets is not mentioned in our sitemap.xml, yet many versions have been indexed anyway. They will continue to be indexed until we disallow them in robots.txt.
It seems challenging to achieve an auto-generated robots.txt that fits our needs well. There are alternative solutions for fixing the versioning issue, such as canonical tags, but these solutions may not be perfectly robust.
I agree with @astrojuanlu opinion that we should index only one version of our docs—the stable version. As a simple and robust solution, I propose reverting to a manually created robots.txt file with the following stable content:

User-agent: *
Disallow: /

Allow: /en/stable/
Allow: /projects/kedro-viz/en/stable/
Allow: /projects/kedro-datasets/en/latest/

This configuration means we will disallow indexing anything except the stable version of Kedro and Viz docs, and the latest version of Datasets docs (since we do not have a stable version of them). If for some reason we are unhappy with the latest Datasets approach, we can start with this and create an additional ticket to explore alternative solutions for Datasets docs versioning, such as canonical tags.

Sep 04 '24 13:09 DimedS

Thanks for the investigation @DimedS. One very important thing to have in mind is this:

If you do want to block this page from Google Search, robots.txt is not the correct mechanism to avoid being indexed. To avoid being indexed, remove the robots.txt block and use 'noindex'.

https://support.google.com/webmasters/answer/7440203#indexed_though_blocked_by_robots_txt

Warning: Don't use a robots.txt file as a means to hide your web pages (including PDFs and other text-based formats supported by Google) from Google search results.

https://developers.google.com/search/docs/crawling-indexing/robots/intro#what-is-a-robots.txt-file-used-for

If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it.

https://developers.google.com/search/docs/crawling-indexing/block-indexing

Google is very clear: blocking a page in robots.txt actually goes against the objective of de-indexing it.

With this in mind, addressing your comments:

I don't follow what problem do you see in the automatically generated robots.txt.
I think the number of indexed pages going up is actually a good thing. What we need to do is to tell the search engine which ones to de-index and which ones to prioritise.
"The primary control lies with robots.txt." That's not my understanding of the situation looking at the Google official documentation (see links above). And also past experience indicates that even before removing our manually crafted robots.txt we had SEO problems too. 3b. "the /projects folder containing Viz and Datasets is not mentioned in our sitemap.xml" that seems like a bug from RTD, maybe worth reporting upstream. But that's not nearly the most important problem of RTD sitemap.xml: it's the weird tendency of including every single version, also older ones (see my earlier analysis).
RTD's generated robots.txt is consistent with the versions that we're hiding in the flyout menu. That decision in itself is debatable, looking at the Google official documentation (links above). If anything, that robots.txt benefits RTD because they get fewer visits from Googlebot:

You can use a robots.txt file for web pages [...] to manage crawling traffic if you think your server will be overwhelmed by requests from Google's crawler, or to avoid crawling unimportant or similar pages on your site.

On this we agree, but blocking older versions will not de-index them, just fill our Search Console with warnings.

Can we instead try generating a sitemap.xml that contains this

  <url>
    <loc>https://docs.kedro.org/en/stable/</loc>
    <lastmod>2024-08-01T18:53:11.571849+00:00</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/latest/</loc>
    <lastmod>2024-08-09T09:39:27.628501+00:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
  </url>

and nothing else?

Sep 04 '24 21:09 astrojuanlu

And notice that generating the sitemap is not even guaranteed to work, but the next step is quite complicated: to retrofit noindex tags to all of the older versions and letting Google index them so that Googlebot can see the noindex properly.

Sep 04 '24 21:09 astrojuanlu

And also I'm guessing that generating the <lastmod> dates on the sitemap.xml won't be easy (in the case of /stable) or even possible (in the case of /latest).

Sep 04 '24 21:09 astrojuanlu

Opened https://github.com/readthedocs/readthedocs.org/issues/11584 upstream.

Sep 04 '24 21:09 astrojuanlu

Thank you for your valuable insights about robots.txt, @astrojuanlu.

I agree that robots.txt is not a silver bullet. What I want to emphasize is that between robots.txt and sitemap.xml, the primary control lies with robots.txt, while sitemap.xml serves merely as a weak recommendation for crawlers. Robots.txt generally works well, but if there's a third-party link to a particular page, the crawler might still index it, even if it’s not allowed in robots.txt.

Here are my thoughts:

First and foremost, we should agree that we only want stable versions (or the latest versions in the case of datasets) of our docs to appear in search results.
To achieve this, there are two options:

Not being indexed at all.
Being indexed but using special tags on every page (<meta name="robots" content="noindex">).

To avoid being indexed:

Using robots.txt is the easiest method, and it works in most cases. Over the past few years, we’ve used it successfully, resulting in around 23,000 pages not being indexed, with only about 4,000 pages indexed.
In my understanding, sitemap.xml does not help prevent pages from being indexed. As demonstrated in our case, the projects/ folder was removed from sitemap.xml but continues to be indexed and shown in search results. This suggests that excluding a page from the sitemap does not prevent it from being indexed. I also believe that sitemap.xml does not significantly influence search result prioritisation.

Based on what I've observed with the indexed pages, after we updated robots.txt and allowed the 0.18.x versions, many of them started being indexed, which likely accounts for most of the recent increase in indexed pages. However, despite disallowing 0.17.7, it is still being indexed, indicating that robots.txt doesn’t always work perfectly as you explained, though such cases are rare. We could consider modifying the rules for specific versions, but overall, robots.txt has been quite effective.

Therefore, I support your proposal to give sitemap.xml a chance, even though I’m skeptical that it will help. We could try using sitemap.xml alone or in conjunction with the custom robots.txt that I previously suggested. However, as the next step after it, I believe we should continue with a simplified robots.txt and then address the specific cases where it hasn’t worked.

Sep 05 '24 09:09 DimedS

I fully agree 👍🏼 Let's at least try to apply the scientific method, and change 1 bit at a time. If a custom sitemap.xml doesn't work in the next version, let's try the robots.txt approach you suggest.

Sep 05 '24 09:09 astrojuanlu

xref in case it's useful https://github.com/jdillard/sphinx-sitemap

Sep 05 '24 12:09 astrojuanlu

xref in case it's useful https://github.com/jdillard/sphinx-sitemap

Would it make sense to manually create the sitemap first and see if it works as expected? If successful, we could then consider incorporating an automated generation process in the next step, if needed.

Sep 05 '24 13:09 DimedS

For reference, I tried the redirection trick described in https://github.com/kedro-org/kedro/issues/3741#issuecomment-2051247642 for kedro-datasets https://github.com/kedro-org/kedro/pull/4145#discussion_r1746621971 and seems to be working.

I don't want to boil the ocean right now because we're in the middle of some delicate SEO experimentation phase, but when the dust settles, I will propose this for all our projects.

Sep 06 '24 07:09 astrojuanlu

kedro kedro copied to clipboard

Improve SEO and maintenance of documentation versions

What I understand after investigation:

kedro
kedro copied to clipboard