portaljs icon indicating copy to clipboard operation
portaljs copied to clipboard

Google Dataset search fields added for dataset pages (automatically)

Open rufuspollock opened this issue 1 year ago • 2 comments

Add some special fields to DataHub dataset pages so they get indexed better by google.

See here for instructions https://developers.google.com/search/docs/appearance/structured-data/dataset

Should be pretty simple to do from the metadata we already have for datasets ...

### Tasks
- [ ] Shape this piece of work e.g. research what fields to add, how we could add them ⏲️2h
- [ ] TODO: add implementation steps ...

rufuspollock avatar Jun 29 '24 12:06 rufuspollock

I think the value of this is very high and i suspect doing is very low - we just need to add some fields to the html <head>

rufuspollock avatar Jun 29 '24 12:06 rufuspollock

Situation

Enhancing dataset page indexing in Google Search is crucial for improving visibility and accessibility of our content.

Problem

Currently, our dataset pages lack structured data fields required for optimal indexing according to schema.org standards.

Solution

Implement structured data fields using JSON-LD to provide search engines with detailed metadata about our datasets.

Appetite

Implementation of JSON-LD structured data should be completed within 2-3 days, including testing and adjustments.

Rabbit-holes

  • Ensuring all required fields are correctly populated in JSON-LD.
  • Testing and validating the impact on search rankings may require monitoring by using Google Search Console.
  • Handling potential discrepancies between schema.org guidelines and actual search engine algorithms.

No-goes

Avoid implementing incomplete or incorrect JSON-LD structures that could potentially harm search engine indexing.

Appendix

Example JSON-LD script and suggestions for testing on specific dataset pages like Air Pollution Collection. Regular monitoring through Google Search Console recommended for evaluating effectiveness.

Image

gradedSystem avatar Jul 02 '24 10:07 gradedSystem

@gradedSystem

Can you create a draft of a JSON-LD that would specify exactly which fields we'd include, and from which part of the Data Package they would come from. Something like:

{
  ...
  name: datapackage.title,
  description: datapackage.description,
  license : datapackage.licences[0],
  ...
}

olayway avatar Jul 04 '24 13:07 olayway

This may also be helpful when it comes to implementation: https://nextjs.org/docs/app/building-your-application/optimizing/metadata#json-ld

olayway avatar Jul 04 '24 13:07 olayway

Here is the JSON-LD format that I tried to incorparate everything from the metadate that is available here: https://specs.frictionlessdata.io/data-package/#metadata

<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "description": "datapackage.description",
  "name": "datapackage.name",
  "title": "datapackage.title",
  "url": "datapackage.homepage",
  "identifier": [
    "datapackage.id[0]",
    "datapackage.id[1]",
    ...
  ],
  "isAccessibleForFree": true,
  "license": [
    {
      "@type": "datapackage.licenses[0].title",
      "name": "datapackage.licenses[0].name",
      "url": "datapackage.licenses[0].path"
    },
    {
      "@type": "datapackage.licenses[1].title",
      "name": "datapackage.licenses[1].name",
      "url": "datapackage.licenses[1].path"
    },
    ...
  ],
  "creator": [
    {
      "@type": "datapackage.contributors[0].organization",
      "url": "datapackage.contributors[0].path",
      "name": "datapackage.contributors[0].title",
      "contactPoint": {
        "@type": "ContactPoint",
        "email": "datapackage.contributors[0].email"
      }
    },
    {
      "@type": "datapackage.contributors[1].organization",
      "url": "datapackage.contributors[1].path",
      "name": "datapackage.contributors[1].title",
      "contactPoint": {
        "@type": "ContactPoint",
        "email": "datapackage.contributors[1].email"
      }
    },
      ...
  ],
  "isPartOf": [
    "datapackage.sources[0].path",
    "datapackage.sources[1].path",
    ...
  ],
  "dateCreated": "datapackage.created",
  "dateModified": "datapackage.updated",
  "citation": "datapackage.id",
  "version": "datapackage.version"
}
</script>

cc @olayway

gradedSystem avatar Jul 10 '24 12:07 gradedSystem

Only one question I have is if we can also use other fields listed in schema.org can be used. I'll try to find out. But I think we're good to go.

olayway avatar Jul 10 '24 14:07 olayway

@gradedSystem what's the status of this?

olayway avatar Jul 22 '24 11:07 olayway

The script is being successfully added to the HTML:

image

But when testing any of our core sites URLs it seems they can't even be accessed:

  • https://validator.schema.org/#url=https%3A%2F%2Fdatahub.io%2Fcore%2Ffinance-vix
  • https://search.google.com/test/rich-results/result?id=Deb_u6CqnsguXS0RHHn2mA

This is because our dataset pages still return 500 initially. Old issue that we thought was fixed (or rather for which we found a workaround): https://github.com/datopian/datahub-next/issues/275

FIXED and will open a new one for 500 errors

olayway avatar Aug 09 '24 12:08 olayway