domains icon indicating copy to clipboard operation
domains copied to clipboard

Please specifiy your crawler User-agent for robots.txt

Open DamonHD opened this issue 2 years ago • 4 comments

It does not appear to be documented, and your crawler is wasting lots of bandwidth downloading lots of media files that do not contain any URL metadata.

DamonHD avatar Feb 15 '24 11:02 DamonHD

Hi,

Do you have exact request from your access logs, maybe? That'll definitely help resolve this faster.

tb0hdan avatar Feb 15 '24 15:02 tb0hdan

Samples:

www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:01 +0000] "GET /img/audio/AudioMoth/20210402-prewash/ZoomH1n-prewash-end.flac HTTP/2.0" 200 4712277 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)" www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:05 +0000] "GET /img/audio/AudioMoth/20210402-prewash/ZoomH1n-prewash-end.mp3 HTTP/2.0" 200 1254949 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)" www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:08 +0000] "GET /img/audio/AudioMoth/20210402-prewash/ZoomH1n-prewash-end.mp3L HTTP/2.0" 200 303699 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)" www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:13 +0000] "GET /img/audio/AudioMoth/20210402/20210402T1827Z-desk-ambient-AudioMoth-384ksps.flac HTTP/2.0" 200 18384032 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)"

DamonHD avatar Feb 15 '24 16:02 DamonHD

Thank you so much for that, I'll have it fixed before resuming crawling process.

tb0hdan avatar Feb 15 '24 16:02 tb0hdan

Do also please explicitly document which User-agent you respond to in robots.txt!

Thanks for dealing with this quickly.

Rgds

Damon

DamonHD avatar Feb 15 '24 16:02 DamonHD

Fixed + doc: https://github.com/tb0hdan/domains/blob/master/README.md#disabling-domains-project-bot-access-to-your-website

tb0hdan avatar Feb 19 '24 01:02 tb0hdan