readthedocs.org icon indicating copy to clipboard operation
readthedocs.org copied to clipboard

Structured Metadata for Search & SEO

Open davidfischer opened this issue 6 years ago • 10 comments
trafficstars

We could improve the SEO of Read the Docs by using structured metadata. Here's Google's documentation on the subject. Basically, this involves adding special tags (or JSON) to parts of our site that give a deeper understanding of our site.

For example, we could add the following to the output of the documentation for the Read the Docs Sphinx theme or to its project page:

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "SoftwareApplication",
  "name": "Read the Docs Sphinx Theme",
  "description": "The sphinx_rtd_theme is a sphinx theme designed to look modern and be mobile-friendly.",
  "keywords": "sphinx, python, readthedocs",
  "softwareVersion": "0.4.2",
  "softwareHelp": "https://sphinx-rtd-theme.readthedocs.io/en/latest/",
  "operatingSystem": "Windows, Mac, Linux",
  "applicationCategory": "DeveloperApplication",
  "inLanguage": "en",
  "license": "https://opensource.org/licenses/MIT",
  "datePublished": "2018-12-31",
  "url": "https://github.com/rtfd/sphinx_rtd_theme"
}
</script>

See the schema.org docs for "Software Application" for all possible attributes.

To give an example, GitHub itself uses these tags. For example, if you view source on the readthedocs.org page, you'll notice references to schema.org. These are structured metadata.

You can test this metadata in Google's tooling

davidfischer avatar Jan 31 '19 21:01 davidfischer

Google also has docs specifically for marking up software apps

davidfischer avatar Jan 31 '19 21:01 davidfischer

A time ago I was able to extract some similar information from projects, we can use the same code for this https://github.com/rtfd/readthedocs.org/issues/1758#issuecomment-439250406. Probably all this fits better in the sphinx extension?

stsewd avatar Jan 31 '19 22:01 stsewd

This would be a great addition. I think we'd have to output context data from RTD, and pick that up in our sphinx extension. However, we might already have all the metadata and context data we need to do this available in sphinx already. I think the bulk of the work will be in the sphinx extension, injecting this into html output, regardless of theme.

agjohnson avatar Feb 01 '19 19:02 agjohnson

I came across this issue again today. A user had a question on how to accomplish setting the canonical version for SEO purposes, which is a great question. I also realized this applies to translations as well.

Google's guidance on translations is here: https://developers.google.com/search/docs/specialty/international/localized-versions

This does feel like it should be a core RTD feature, given our focus is enabling multiple versions and translations. Perhaps given recent conversations, this should be implemented outside Sphinx though.

agjohnson avatar Nov 11 '22 17:11 agjohnson

I gave a quick stab at this for our own docs, but it wasn't actually clear how to relate multiple versions of the same page together. As far as I can tell, this is not part of the SoftwareApplication schema type. There is a way to define translation relationships, but not for versions.

That's at least what I gather from https://github.com/schemaorg/schemaorg/issues/1476

From https://github.com/schemaorg/schemaorg/issues/975#issuecomment-671190715, it seems isPartOf could be used?

agjohnson avatar Nov 11 '22 19:11 agjohnson

I do see different tasks here:

  • JSON metadata on web application (.org/.com): this can be done by statically adding this data in the base.html Django template
  • JSON metadata on documentation pages (.io): this could be done in a Sphinx extension that users can decide whether or not to install (similar to what we did with sphinx-notfound-page)
  • Canonical version: looks like an application feature similar to the canonical URL but including the "canonical version" on it as well instead of pointing to the root of the domain

humitos avatar Nov 14 '22 08:11 humitos

this could be done in a Sphinx extension that users can decide whether or not to install

This feels like more of a core feature, not something that should be optional or only supported in Sphinx. With the work we're describing around generalizing all of the Sphinx extensions we've authored, I'm not sure I'd start with a Sphinx extension for new feature tests when we have the option of making it an agnostic post-processing step instead.

JSON metadata on web application (.org/.com)

I wasn't considering this, what exactly is the use case you see here?

Canonical version

I think we're describing addressing documentation versioning SEO with schema metadata, not a separate feature. Google, in theory, uses this metadata for SEO purposes, though they don't say specifically what they do with multiple versions of the same documentation.

agjohnson avatar Nov 14 '22 18:11 agjohnson

@agjohnson

This feels like more of a core feature, not something that should be optional or only supported in Sphinx. With the work we're describing around generalizing all of the Sphinx extensions we've authored, I'm not sure I'd start with a Sphinx extension for new feature tests when we have the option of making it an agnostic post-processing step instead.

We don't know all the information required to construct the JSON that David described. How are you considering gathering all this information?

humitos avatar Nov 15 '22 11:11 humitos

In that example, not all of the attributes are required. What I'm mostly interested in is building up the graph of documentation projects/versions/translations linking to each other. Right now, versions and translations might be considered duplicate content to Google, and this could be negatively affecting SEO for projects.

The (big?) hang up is that the current schema does not offer an explicit way to define the version relationships between pages/projects. This is where partOf attribute might be needed. The does does have a mechanism for linking project translations together however, and that could be a good place to start.

agjohnson avatar Nov 15 '22 17:11 agjohnson

Do we know all this data when serving the page? If so, we can implement this feature in a simple and generic way via a CF worker and inject this HTML tag at the CDN.

humitos avatar Nov 07 '23 15:11 humitos