portaljs Google Dataset search fields added for dataset pages (automatically)

Add some special fields to DataHub dataset pages so they get indexed better by google.

See here for instructions https://developers.google.com/search/docs/appearance/structured-data/dataset

Should be pretty simple to do from the metadata we already have for datasets ...

### Tasks
- [ ] Shape this piece of work e.g. research what fields to add, how we could add them ⏲️2h
- [ ] TODO: add implementation steps ...

Jun 29 '24 12:06 rufuspollock

I think the value of this is very high and i suspect doing is very low - we just need to add some fields to the html <head>

Jun 29 '24 12:06 rufuspollock

Situation

Enhancing dataset page indexing in Google Search is crucial for improving visibility and accessibility of our content.

Problem

Currently, our dataset pages lack structured data fields required for optimal indexing according to schema.org standards.

Solution

Implement structured data fields using JSON-LD to provide search engines with detailed metadata about our datasets.

Appetite

Implementation of JSON-LD structured data should be completed within 2-3 days, including testing and adjustments.

Rabbit-holes

Ensuring all required fields are correctly populated in JSON-LD.
Testing and validating the impact on search rankings may require monitoring by using Google Search Console.
Handling potential discrepancies between schema.org guidelines and actual search engine algorithms.

No-goes

Avoid implementing incomplete or incorrect JSON-LD structures that could potentially harm search engine indexing.

Appendix

Example JSON-LD script and suggestions for testing on specific dataset pages like Air Pollution Collection. Regular monitoring through Google Search Console recommended for evaluating effectiveness.

Jul 02 '24 10:07 gradedSystem

@gradedSystem

Can you create a draft of a JSON-LD that would specify exactly which fields we'd include, and from which part of the Data Package they would come from. Something like:

{
  ...
  name: datapackage.title,
  description: datapackage.description,
  license : datapackage.licences[0],
  ...
}

Jul 04 '24 13:07 olayway

This may also be helpful when it comes to implementation: https://nextjs.org/docs/app/building-your-application/optimizing/metadata#json-ld

Jul 04 '24 13:07 olayway

Here is the JSON-LD format that I tried to incorparate everything from the metadate that is available here: https://specs.frictionlessdata.io/data-package/#metadata

<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "description": "datapackage.description",
  "name": "datapackage.name",
  "title": "datapackage.title",
  "url": "datapackage.homepage",
  "identifier": [
    "datapackage.id[0]",
    "datapackage.id[1]",
    ...
  ],
  "isAccessibleForFree": true,
  "license": [
    {
      "@type": "datapackage.licenses[0].title",
      "name": "datapackage.licenses[0].name",
      "url": "datapackage.licenses[0].path"
    },
    {
      "@type": "datapackage.licenses[1].title",
      "name": "datapackage.licenses[1].name",
      "url": "datapackage.licenses[1].path"
    },
    ...
  ],
  "creator": [
    {
      "@type": "datapackage.contributors[0].organization",
      "url": "datapackage.contributors[0].path",
      "name": "datapackage.contributors[0].title",
      "contactPoint": {
        "@type": "ContactPoint",
        "email": "datapackage.contributors[0].email"
      }
    },
    {
      "@type": "datapackage.contributors[1].organization",
      "url": "datapackage.contributors[1].path",
      "name": "datapackage.contributors[1].title",
      "contactPoint": {
        "@type": "ContactPoint",
        "email": "datapackage.contributors[1].email"
      }
    },
      ...
  ],
  "isPartOf": [
    "datapackage.sources[0].path",
    "datapackage.sources[1].path",
    ...
  ],
  "dateCreated": "datapackage.created",
  "dateModified": "datapackage.updated",
  "citation": "datapackage.id",
  "version": "datapackage.version"
}
</script>

cc @olayway

Jul 10 '24 12:07 gradedSystem

Only one question I have is if we can also use other fields listed in schema.org can be used. I'll try to find out. But I think we're good to go.

Jul 10 '24 14:07 olayway

@gradedSystem what's the status of this?

Jul 22 '24 11:07 olayway

The script is being successfully added to the HTML:

But when testing any of our core sites URLs it seems they can't even be accessed:

https://validator.schema.org/#url=https%3A%2F%2Fdatahub.io%2Fcore%2Ffinance-vix
https://search.google.com/test/rich-results/result?id=Deb_u6CqnsguXS0RHHn2mA

This is because our dataset pages still return 500 initially. Old issue that we thought was fixed (or rather for which we found a workaround): https://github.com/datopian/datahub-next/issues/275

FIXED and will open a new one for 500 errors

Aug 09 '24 12:08 olayway