science-on-schema.org
science-on-schema.org copied to clipboard
Additional hints in sitemaps to support efficient harvesting
Some collections may have large numbers of records describing different kinds of information (e.g. Datasets, Awards, and People) that may each have landing pages, and each landing page may have an entry in the sitemap.
An indexer only interested in Datasets would need to inspect all entries advertised in the sitemap to find Dataset entries, which can be inefficient and a needless use of resources.
Sitemaps are extensible, and one option may be to provide type hints in the <url>
section of the sitemap. For example:
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
<url>
<loc>https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2ST7DZ2Q</loc>
<lastmod>2021-12-07T12:15:05Z</lastmod>
<rdf:type>http://schema.org/Dataset</rdf:type>
</url>
</urlset>
An obvious challenge is that many types may be expressed in a single landing page, and so which should be specified in the hint? This would be up to the provider, if there is a clear intention of presenting a specific type in the referenced <loc>
, then a hint can be provided, and such hints may be used by a consumer.