Google Dataset search fields added for dataset pages (automatically)
Add some special fields to DataHub dataset pages so they get indexed better by google.
See here for instructions https://developers.google.com/search/docs/appearance/structured-data/dataset
Should be pretty simple to do from the metadata we already have for datasets ...
### Tasks
- [ ] Shape this piece of work e.g. research what fields to add, how we could add them ⏲️2h
- [ ] TODO: add implementation steps ...
I think the value of this is very high and i suspect doing is very low - we just need to add some fields to the html <head>
Situation
Enhancing dataset page indexing in Google Search is crucial for improving visibility and accessibility of our content.
Problem
Currently, our dataset pages lack structured data fields required for optimal indexing according to schema.org standards.
Solution
Implement structured data fields using JSON-LD to provide search engines with detailed metadata about our datasets.
Appetite
Implementation of JSON-LD structured data should be completed within 2-3 days, including testing and adjustments.
Rabbit-holes
- Ensuring all required fields are correctly populated in JSON-LD.
- Testing and validating the impact on search rankings may require monitoring by using Google Search Console.
- Handling potential discrepancies between schema.org guidelines and actual search engine algorithms.
No-goes
Avoid implementing incomplete or incorrect JSON-LD structures that could potentially harm search engine indexing.
Appendix
Example JSON-LD script and suggestions for testing on specific dataset pages like Air Pollution Collection. Regular monitoring through Google Search Console recommended for evaluating effectiveness.
@gradedSystem
Can you create a draft of a JSON-LD that would specify exactly which fields we'd include, and from which part of the Data Package they would come from. Something like:
{
...
name: datapackage.title,
description: datapackage.description,
license : datapackage.licences[0],
...
}
This may also be helpful when it comes to implementation: https://nextjs.org/docs/app/building-your-application/optimizing/metadata#json-ld
Here is the JSON-LD format that I tried to incorparate everything from the metadate that is available here: https://specs.frictionlessdata.io/data-package/#metadata
<script type="application/ld+json">
{
"@context": "https://schema.org/",
"@type": "Dataset",
"description": "datapackage.description",
"name": "datapackage.name",
"title": "datapackage.title",
"url": "datapackage.homepage",
"identifier": [
"datapackage.id[0]",
"datapackage.id[1]",
...
],
"isAccessibleForFree": true,
"license": [
{
"@type": "datapackage.licenses[0].title",
"name": "datapackage.licenses[0].name",
"url": "datapackage.licenses[0].path"
},
{
"@type": "datapackage.licenses[1].title",
"name": "datapackage.licenses[1].name",
"url": "datapackage.licenses[1].path"
},
...
],
"creator": [
{
"@type": "datapackage.contributors[0].organization",
"url": "datapackage.contributors[0].path",
"name": "datapackage.contributors[0].title",
"contactPoint": {
"@type": "ContactPoint",
"email": "datapackage.contributors[0].email"
}
},
{
"@type": "datapackage.contributors[1].organization",
"url": "datapackage.contributors[1].path",
"name": "datapackage.contributors[1].title",
"contactPoint": {
"@type": "ContactPoint",
"email": "datapackage.contributors[1].email"
}
},
...
],
"isPartOf": [
"datapackage.sources[0].path",
"datapackage.sources[1].path",
...
],
"dateCreated": "datapackage.created",
"dateModified": "datapackage.updated",
"citation": "datapackage.id",
"version": "datapackage.version"
}
</script>
cc @olayway
Only one question I have is if we can also use other fields listed in schema.org can be used. I'll try to find out. But I think we're good to go.
@gradedSystem what's the status of this?
The script is being successfully added to the HTML:
But when testing any of our core sites URLs it seems they can't even be accessed:
- https://validator.schema.org/#url=https%3A%2F%2Fdatahub.io%2Fcore%2Ffinance-vix
- https://search.google.com/test/rich-results/result?id=Deb_u6CqnsguXS0RHHn2mA
This is because our dataset pages still return 500 initially. Old issue that we thought was fixed (or rather for which we found a workaround): https://github.com/datopian/datahub-next/issues/275
FIXED and will open a new one for 500 errors