domains Please specifiy your crawler User-agent for robots.txt

It does not appear to be documented, and your crawler is wasting lots of bandwidth downloading lots of media files that do not contain any URL metadata.

Feb 15 '24 11:02 DamonHD

Hi,

Do you have exact request from your access logs, maybe? That'll definitely help resolve this faster.

Feb 15 '24 15:02 tb0hdan

Samples:

www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:01 +0000] "GET /img/audio/AudioMoth/20210402-prewash/ZoomH1n-prewash-end.flac HTTP/2.0" 200 4712277 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)" www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:05 +0000] "GET /img/audio/AudioMoth/20210402-prewash/ZoomH1n-prewash-end.mp3 HTTP/2.0" 200 1254949 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)" www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:08 +0000] "GET /img/audio/AudioMoth/20210402-prewash/ZoomH1n-prewash-end.mp3L HTTP/2.0" 200 303699 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)" www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:13 +0000] "GET /img/audio/AudioMoth/20210402/20210402T1827Z-desk-ambient-AudioMoth-384ksps.flac HTTP/2.0" 200 18384032 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)"

Feb 15 '24 16:02 DamonHD

Thank you so much for that, I'll have it fixed before resuming crawling process.

Feb 15 '24 16:02 tb0hdan

Do also please explicitly document which User-agent you respond to in robots.txt!

Thanks for dealing with this quickly.

Rgds

Damon

Feb 15 '24 16:02 DamonHD

Fixed + doc: https://github.com/tb0hdan/domains/blob/master/README.md#disabling-domains-project-bot-access-to-your-website

Feb 19 '24 01:02 tb0hdan