readthedocs.org
readthedocs.org copied to clipboard
Structured Metadata for Search & SEO
We could improve the SEO of Read the Docs by using structured metadata. Here's Google's documentation on the subject. Basically, this involves adding special tags (or JSON) to parts of our site that give a deeper understanding of our site.
For example, we could add the following to the output of the documentation for the Read the Docs Sphinx theme or to its project page:
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "SoftwareApplication",
"name": "Read the Docs Sphinx Theme",
"description": "The sphinx_rtd_theme is a sphinx theme designed to look modern and be mobile-friendly.",
"keywords": "sphinx, python, readthedocs",
"softwareVersion": "0.4.2",
"softwareHelp": "https://sphinx-rtd-theme.readthedocs.io/en/latest/",
"operatingSystem": "Windows, Mac, Linux",
"applicationCategory": "DeveloperApplication",
"inLanguage": "en",
"license": "https://opensource.org/licenses/MIT",
"datePublished": "2018-12-31",
"url": "https://github.com/rtfd/sphinx_rtd_theme"
}
</script>
See the schema.org docs for "Software Application" for all possible attributes.
To give an example, GitHub itself uses these tags. For example, if you view source on the readthedocs.org page, you'll notice references to schema.org. These are structured metadata.
You can test this metadata in Google's tooling
Google also has docs specifically for marking up software apps
A time ago I was able to extract some similar information from projects, we can use the same code for this https://github.com/rtfd/readthedocs.org/issues/1758#issuecomment-439250406. Probably all this fits better in the sphinx extension?
This would be a great addition. I think we'd have to output context data from RTD, and pick that up in our sphinx extension. However, we might already have all the metadata and context data we need to do this available in sphinx already. I think the bulk of the work will be in the sphinx extension, injecting this into html output, regardless of theme.
I came across this issue again today. A user had a question on how to accomplish setting the canonical version for SEO purposes, which is a great question. I also realized this applies to translations as well.
Google's guidance on translations is here: https://developers.google.com/search/docs/specialty/international/localized-versions
This does feel like it should be a core RTD feature, given our focus is enabling multiple versions and translations. Perhaps given recent conversations, this should be implemented outside Sphinx though.
I gave a quick stab at this for our own docs, but it wasn't actually clear how to relate multiple versions of the same page together. As far as I can tell, this is not part of the SoftwareApplication schema type. There is a way to define translation relationships, but not for versions.
That's at least what I gather from https://github.com/schemaorg/schemaorg/issues/1476
From https://github.com/schemaorg/schemaorg/issues/975#issuecomment-671190715, it seems isPartOf could be used?
I do see different tasks here:
- JSON metadata on web application (.org/.com): this can be done by statically adding this data in the
base.htmlDjango template - JSON metadata on documentation pages (.io): this could be done in a Sphinx extension that users can decide whether or not to install (similar to what we did with
sphinx-notfound-page) - Canonical version: looks like an application feature similar to the canonical URL but including the "canonical version" on it as well instead of pointing to the root of the domain
this could be done in a Sphinx extension that users can decide whether or not to install
This feels like more of a core feature, not something that should be optional or only supported in Sphinx. With the work we're describing around generalizing all of the Sphinx extensions we've authored, I'm not sure I'd start with a Sphinx extension for new feature tests when we have the option of making it an agnostic post-processing step instead.
JSON metadata on web application (.org/.com)
I wasn't considering this, what exactly is the use case you see here?
Canonical version
I think we're describing addressing documentation versioning SEO with schema metadata, not a separate feature. Google, in theory, uses this metadata for SEO purposes, though they don't say specifically what they do with multiple versions of the same documentation.
@agjohnson
This feels like more of a core feature, not something that should be optional or only supported in Sphinx. With the work we're describing around generalizing all of the Sphinx extensions we've authored, I'm not sure I'd start with a Sphinx extension for new feature tests when we have the option of making it an agnostic post-processing step instead.
We don't know all the information required to construct the JSON that David described. How are you considering gathering all this information?
In that example, not all of the attributes are required. What I'm mostly interested in is building up the graph of documentation projects/versions/translations linking to each other. Right now, versions and translations might be considered duplicate content to Google, and this could be negatively affecting SEO for projects.
The (big?) hang up is that the current schema does not offer an explicit way to define the version relationships between pages/projects. This is where partOf attribute might be needed. The does does have a mechanism for linking project translations together however, and that could be a good place to start.
Do we know all this data when serving the page? If so, we can implement this feature in a simple and generic way via a CF worker and inject this HTML tag at the CDN.