datacube-explorer icon indicating copy to clipboard operation
datacube-explorer copied to clipboard

Keep Away Well-behaved robots

Open whatnick opened this issue 1 year ago • 4 comments

Supply Header tags on HTML pages, top level robots.txt and HTTP Headers in STAC to prevent excessive crawling and associated DB loads.

An implementation for these recommendations : https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag

whatnick avatar Mar 02 '23 02:03 whatnick

The question is whether you want to block all (well-behaved) bots at all levels. The problem that I have seen comes in when the bots start hitting the individual day pages. If you want to have the top level products discoverable via search engines, then perhaps this is only needed at certain levels. Perhaps the first level below /products/ could be allowed and then lower down disallowed.

JonDHo avatar Mar 02 '23 02:03 JonDHo

I have just finished implementing a robots.txt (by adding a suitable ingress with a fixed response) and can confirm that the removal of bots resulted in a significant improvement on DB load (using AWS RDS Serverless).

The image below shows the effect of adding the robots.txt to explorer on the backend DB load. DB usage drops from around 1.5-2 Aurora capacity units (ACU) to just above 0.5. The minimum for this DB is set to 0.5.

image

JonDHo avatar Mar 03 '23 00:03 JonDHo

If you want to have the top level products discoverable via search engines, then perhaps this is only needed at certain levels. Perhaps the first level below /products/ could be allowed and then lower down disallowed.

This sounds like a sensible compromise to me. Is this what you implemented?

I have just finished implementing a robots.txt (by adding a suitable ingress with a fixed response) and can confirm that the removal of bots resulted in a significant improvement on DB load (using AWS RDS Serverless).

Could you share a copy of the robots.txt you created? Even if we don't put it into the Explorer code, it would be great to have as documentation.

@JonDHo

omad avatar Mar 09 '23 05:03 omad

I am currently using the example below. This permits access to all general pages, including top level product pages, but none of the year, month, day or dataset pages

User-Agent: *
Allow: /
Disallow: /products/*/*

See: https://explorer.datacubechile.cl/robots.txt

JonDHo avatar Mar 16 '23 15:03 JonDHo

Just an additional comment on this after the latest PR, I would also recommend adding:

Disallow: /dataset/*

to the default robots.txt. Bots hitting each individual dataset have been a big problem for me and the /dataset/* pages redirect to /products/ but they are valid URLs. I have actually now gone to the extent of disallowing everything as there isn't much benefit in having even the product pages discoverable via search engines. I would rather have my project website be the entry point for users searching the web, not explorer.

JonDHo avatar Jun 18 '24 16:06 JonDHo