readthedocs.org icon indicating copy to clipboard operation
readthedocs.org copied to clipboard

Generate a sitemap index / Allow custom sitemap.xml

Open humitos opened this issue 6 years ago • 10 comments
trafficstars

We already are generating sitemap.xml for all projects by default. Although, we don't consider any sitemap.xml generated by Sphinx at all.

This issue is the continuation of #557 and this specific comment about creating a global sitemap index at root pointing to the ones that are in subpaths.

Related: #6903

humitos avatar Mar 04 '19 10:03 humitos

Hi @humitos does this feature still need amendments or ready to implement? If yes can you provide with some insights to accomplish this. Thank You.

aditya-prayaga avatar Mar 06 '19 18:03 aditya-prayaga

@aditya-369 the issue is under "Design decision" (https://docs.readthedocs.io/en/latest/contribute.html#initial-triage) and need some discussion still.

Although, if you follow the links from the description you will find some extra context and proposals about how to implement it. If you want, you can read them all and make a more specific proposal on how this could be implemented and we can discuss over a specific proposal which would be better and easier. Thanks for the interest!

This is definitely something that we want to have as a good feature.

humitos avatar Mar 11 '19 15:03 humitos

I noticed sitemap.xml started being generated recently, thanks for this extremely useful development! I have just read through the previous discussion and PR. A couple of points (can open separate issues if you prefer):

  • The generated sitemap currently gets the URL by calling get_docs_url, which by definition returns an http address. But this is usually a 301 redirect to an https URL rather than an actual page, which may result in a penalty with some search engines. Sitemaps should only point to actual pages (references: Google, Bing) or the crawler may begin losing trust in the sitemap.
  • The hreflang for regional variations in the sitemap follows the format of the URL language slug generated by Sphinx, e.g. zh_CN for Chinese (China). This is invalid syntax for hreflang in a sitemap (reference), where a hyphen must be used instead, e.g. zh-CN. Alternatively, it is also valid to define the script of the language instead, e.g. zh-Hans for Simplified Chinese.
  • The sort order should prioritise the user-selected default version (Admin > Versions > Default Version) in the backend instead of setting latest to highest priority. In many cases (and also by definition), latest points at development documentation while stable is the version most people should be using, and should appear first in search results.

I am looking forward to further development of this feature and want to contribute if possible. I'm not much of a coder but willing to learn or help testing on my fairly large and complex documentation. My preferred implementation would be to add an option in conf.py to generate a user-controlled sitemap at https://$url/$lang/$version/sitemap.xml and group these together in an automatically generated sitemap index at https://$url/sitemap_index.xml, and then specifying this file in robots.txt.

strophy avatar Mar 12 '19 08:03 strophy

@strophy I appreciate your feedback here.

The generated sitemap currently gets the URL by calling get_docs_url, which by definition returns an http address

I think the docstring of that method is wrong. For now, it only returns HTTP when its a custom domain because we can't guarantee that it has SSL setup (see #4641)

# This is from current production's server
In [1]: docs = Project.objects.get(slug='docs')

In [2]: docs.get_docs_url()
Out[2]: 'https://docs.readthedocs.io/en/stable/'

In [3]: pip = Project.objects.get(slug='pip')

In [4]: pip.get_docs_url()
Out[4]: 'http://pip.pypa.io/en/stable/'

A couple of points (can open separate issues if you prefer):

Yes, please. This issue is about generating a sitemap index and your suggestions/reports are about bugs in the current implementation. I'd appreciate if you create one issue per problem. Thanks!

humitos avatar Mar 12 '19 11:03 humitos

@humitos, your current sitemap.xml can't be configured from the project side. This may be handy if you want to disallow index for some versions (Google Search Console consider as an error that you submit URLs, which are blocked by robots.txt).

skirpichev avatar Sep 09 '19 19:09 skirpichev

@skirpichev I'm not really sure to follow your issue. Can you expand and give an example of what you are trying to do?

humitos avatar Sep 09 '19 20:09 humitos

@humitos, I'm not sure it's a real issue, maybe a minor one. But lets suppose you want to disable certain versions in the readthedocs docs. Your docs suggests this variant with robots.txt. But project's sitemap.xml will still provide these "disallowed" versions. Google Search Console consider this as a misconfiguration.

skirpichev avatar Sep 10 '19 05:09 skirpichev

If you disable Versions from your Project, they are not going to be shown in the sitemap.xml.

For other more complex cases is this issue about. Examples,

  • being able to define your own sitemap.xml instead of RTD generating one automatically for you
  • creating a sitemap index at the root that points to other sitemap.xml from other directories (see https://www.sitemaps.org/protocol.html#index)

humitos avatar Sep 10 '19 09:09 humitos

If you disable Versions from your Project, they are not going to be shown in the sitemap.xml.

For other more complex cases is this issue about. Examples,

  • being able to define your own sitemap.xml instead of RTD generating one automatically for you
  • creating a sitemap index at the root that points to other sitemap.xml from other directories (see https://www.sitemaps.org/protocol.html#index)

This is true if you disable a version, make it inactive, it is not true if you hide a version. The result is crawlers get confused, as the hidden version gets added to Disallow in robots.txt but still remains in sitemap.xml. It's unclear to me if there is a purpose for this, seems to me hidden versions should be removed from the generated sitemap.xml if they're also going to be disallowed, same as disabled. Regardless, I documented my workaround here.

This specific example can be seen in pyngrok's documentation. 4.1.9, for example is active but hidden—, so id does not show up in the menu anymore (but we want permalinks to continue working), yet it does still show up in the auto-generated sitemap.

alexdlaird avatar Aug 25 '20 15:08 alexdlaird

I just got a support request from a user saying that the sitemap.xml generated by Read the Docs does not work as they expected.

being able to define your own sitemap.xml instead of RTD generating one automatically for you

I think this should be the way to go. In a similar way as we do with robots.txt. That way, users could generate their sitemap.xml in the exact way as they want if the one created by Read the Docs is not enough for them.

humitos avatar Aug 24 '22 09:08 humitos

It seems there is no need to build a feature to allow users to define a custom sitemap.xml since they can just define one by using a custom robots.txt. Example:

User-agent: *
Allow: /

Sitemap: https://docs.example.com/en/stable/sitemap.xml

Read more about this at https://docs.readthedocs.io/en/stable/reference/sitemaps.html#custom-sitemap-xml

I'm closing this issue since we already have documented how to achieve this goal. If you consider there are still missing pieces here, please open new issues.

humitos avatar Apr 09 '24 17:04 humitos