metacatui
metacatui copied to clipboard
default robots.txt to control harvesting
Describe the feature you'd like
Add a robots.txt file that is easily configured for production and test deployments.
For production, the file should generally restrict access to the package service, but allow everything else, and provide the sitemap link:
User-agent: *
Disallow: /metacat/d1/mn/v2/packages/
Sitemap: https://arcticdata.io/sitemap_index.xml
For testing, the file should restrict access to everything:
User-agent: *
Disallow: /
Is your feature request related to a problem? Please describe.
Duplicated test datasets show up in Google Dataset search and flood the results, making it hard to find the real production datasets.
MetacatUI provides a searchable web interface which, in combination with a metacat-provided sitemap.xml document, enables harvesters like Googlebot and others to index the site and all of its datasets. Generally we do not want those harvesters to index any test deployments, as they generally have bogus content. An example can be seen here:
- https://datasetsearch.research.google.com/search?src=0&query=site%3Aarcticdata.io
Considerations
- For some deployments, MetacatUI is not installed at the root of a web site, and so would not control the
robots.txtdeployment. For example, for ADC the MetacatUI is installed at https://arcticdata.io/catalog, and so the robots.txt needs to go at the root. - Containerized Kubernetes deployments could probably be configured through helm or other mechanisms to do the right thing conditionally, and allow for site-overrides via values.yaml
Manually deployed some robots.txt files, tracking in our deployments list here: https://docs.google.com/spreadsheets/d/1NtF8DAZCg6eGKGY66ca2nYi2ftmVVGR3-0IHlAZKZCI/edit#gid=0
Added to metacat helm chart in PR 1893.
This adds a robots.txt to the metacat installation, and adds a rewrite rule to redirect /robots.txt to its location.
If the metacat property sitemap.enabled=false (the default setting for k8s metacat deployment), then robots.txt defaults to:
User-agent: *
Disallow: /
If sitemap.enabled=true, then robots.txt defaults to:
User-agent: *
Disallow: /<metacat.application.context>/d1/mn/v2/packages/
Sitemap: /sitemap_index.xml
...but values for User-agent: and Disallow: can be customized via Values.yaml.
- [X] test.arcticdata.io (verified robots.txt)
- [X] demo.arcticdata.io (verified that it exists here during the front end call yesterday; robots.txt)
- [X] handy-owl.nceas.ucsb.edu (Rushi's personal test server; added robots.txt)
- [X] dev.nceas.ucsb.edu (verified existing robots.txt)
- [ ] ...