specs icon indicating copy to clipboard operation
specs copied to clipboard

Can data packages be made easily findable using search engines?

Open HeidiSeibold opened this issue 5 years ago • 7 comments

It seems to me that with schema.org and this there is currently a movement toward making data sets more easily findable with search engines.

How do these efforts relate to data packages?

Side note: I have no technical knowledge about search engines nor do I really understand what schema.org does. I am just a researcher who wants to make her data sets findable :woman_scientist:

HeidiSeibold avatar Nov 02 '18 08:11 HeidiSeibold

@HeidiSeibold great to have you flag this and there has been a fair amount of discussion. The basic method to make data packages better discoverable will be to add the relevant meta tags or other info into the html page where the data package is catalogged. This is something already supported in e.g. https://datahub.io and also is automatically supported in most CKAN based data portals.

In terms of the data package specs I think we could definitely publish a pattern that suggests a standard mapping for data package metadata to the tags you can add to your html page.

rufuspollock avatar Nov 05 '18 22:11 rufuspollock

For Google, specifically the recently announced Google Dataset Search, the relevant documentation is https://developers.google.com/search/docs/data-types/dataset which documents the specific ways in which Google understands schema.org and W3C DCAT for dataset discovery (including use of sitemap files, canonical URLs for de-duplication, etc.).

  • Google Dataset Search: https://toolbox.google.com/datasetsearch
  • Announcement: https://www.blog.google/products/search/making-it-easier-discover-datasets/
  • Earlier blog post: https://ai.googleblog.com/2017/01/facilitating-discovery-of-public.html
  • Google structured data testing tool: https://developers.google.com/structured-data/testing-tool/

In brief, we can extract simple dataset descriptions that use http://schema.org/Dataset or a similar structure using DCAT, from dataset-describing pages that use any of 1) JSON-LD in a script tag, 2) RDFa 1.1, 3.) Microdata syntaxes. For dataset descriptions that use these notations, like W3C Data Cube, W3C CSVW, we're looking into direct support. For other formats and approaches it may be useful to collaborate on mappings.

Looking at https://datahub.io/JohnSnowLabs/uk-greater-london-public-expenditures I don't see the markup directly appearing. What codebase runs datahub.io, is it different to CKAN? My understanding is that general CKAN now has support for the schema.org markup and/or DCAT, either directly in latest version or via the DCAT extension.

Nearby:

  • https://github.com/ckan/ckanext-dcat/issues/75
  • https://github.com/ckan/ideas-and-roadmap/issues/220a
  • https://github.com/ckan/ckanext-dcat/issues/137

danbri avatar Nov 27 '18 18:11 danbri

/cc @serahrono who I just met at Wikicite conference :)

danbri avatar Nov 27 '18 18:11 danbri

/cc @amercader @metaodi

https://ckan.org/2018/04/30/make-open-data-discoverable-for-search-engines/

danbri avatar Nov 27 '18 19:11 danbri

@HeidiSeibold 👋

As @danbri already mentioned, if you use CKAN and the latest ckanext-dcat extension, you're set up to feed Google (and whoever else supports schema.org/Dataset) and appear in search results (iirc currently limited to the above linked "Dataset Search").

@rufuspollock does the support in datahub.io imply, that such a mapping already exists?

Anyway: since CKAN can handle both data package and schema.org, it should be fairly easy to extract.

metaodi avatar Nov 27 '18 20:11 metaodi

Give me shout (maybe a twitter ping, am 'danbri' there) if I can help on this, in case I miss the github msgs in the noise...

danbri avatar Dec 07 '18 19:12 danbri

@danbri @metaodi DataHub.io does not run CKAN but we are a set of the attributes that @danbri mentions so that these datasets automatically show up in Google Dataset search :smile:

It would also be useful to produce a published "pattern" on https://frictionlessdata.io/specs/patterns/ that maps Data Package metadata to the structure needed for Google

rufuspollock avatar Jan 06 '19 14:01 rufuspollock