metacatui icon indicating copy to clipboard operation
metacatui copied to clipboard

default robots.txt to control harvesting

Open mbjones opened this issue 1 year ago • 3 comments

Describe the feature you'd like

Add a robots.txt file that is easily configured for production and test deployments.

For production, the file should generally restrict access to the package service, but allow everything else, and provide the sitemap link:

User-agent: *
Disallow: /metacat/d1/mn/v2/packages/
Sitemap: https://arcticdata.io/sitemap_index.xml

For testing, the file should restrict access to everything:

User-agent: *
Disallow: /

Is your feature request related to a problem? Please describe.

Duplicated test datasets show up in Google Dataset search and flood the results, making it hard to find the real production datasets.

MetacatUI provides a searchable web interface which, in combination with a metacat-provided sitemap.xml document, enables harvesters like Googlebot and others to index the site and all of its datasets. Generally we do not want those harvesters to index any test deployments, as they generally have bogus content. An example can be seen here:

  • https://datasetsearch.research.google.com/search?src=0&query=site%3Aarcticdata.io

Considerations

  • For some deployments, MetacatUI is not installed at the root of a web site, and so would not control the robots.txt deployment. For example, for ADC the MetacatUI is installed at https://arcticdata.io/catalog, and so the robots.txt needs to go at the root.
  • Containerized Kubernetes deployments could probably be configured through helm or other mechanisms to do the right thing conditionally, and allow for site-overrides via values.yaml

mbjones avatar Feb 16 '24 22:02 mbjones

Manually deployed some robots.txt files, tracking in our deployments list here: https://docs.google.com/spreadsheets/d/1NtF8DAZCg6eGKGY66ca2nYi2ftmVVGR3-0IHlAZKZCI/edit#gid=0

mbjones avatar Feb 23 '24 02:02 mbjones

Added to metacat helm chart in PR 1893.

This adds a robots.txt to the metacat installation, and adds a rewrite rule to redirect /robots.txt to its location.

If the metacat property sitemap.enabled=false (the default setting for k8s metacat deployment), then robots.txt defaults to:

User-agent: *
Disallow: /

If sitemap.enabled=true, then robots.txt defaults to:

User-agent: *
Disallow: /<metacat.application.context>/d1/mn/v2/packages/
Sitemap: /sitemap_index.xml

...but values for User-agent: and Disallow: can be customized via Values.yaml.

artntek avatar Apr 24 '24 20:04 artntek

  • [X] test.arcticdata.io (verified robots.txt)
  • [X] demo.arcticdata.io (verified that it exists here during the front end call yesterday; robots.txt)
  • [X] handy-owl.nceas.ucsb.edu (Rushi's personal test server; added robots.txt)
  • [X] dev.nceas.ucsb.edu (verified existing robots.txt)
  • [ ] ...

rushirajnenuji avatar Apr 30 '24 16:04 rushirajnenuji