Please specifiy your crawler User-agent for robots.txt
It does not appear to be documented, and your crawler is wasting lots of bandwidth downloading lots of media files that do not contain any URL metadata.
Hi,
Do you have exact request from your access logs, maybe? That'll definitely help resolve this faster.
Samples:
www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:01 +0000] "GET /img/audio/AudioMoth/20210402-prewash/ZoomH1n-prewash-end.flac HTTP/2.0" 200 4712277 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)" www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:05 +0000] "GET /img/audio/AudioMoth/20210402-prewash/ZoomH1n-prewash-end.mp3 HTTP/2.0" 200 1254949 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)" www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:08 +0000] "GET /img/audio/AudioMoth/20210402-prewash/ZoomH1n-prewash-end.mp3L HTTP/2.0" 200 303699 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)" www.earth.org.uk:443 216.208.194.51 - - [15/Feb/2024:12:32:13 +0000] "GET /img/audio/AudioMoth/20210402/20210402T1827Z-desk-ambient-AudioMoth-384ksps.flac HTTP/2.0" 200 18384032 "-" "Mozilla/5.0 (compatible; Domains Project/1.3.7; +https://domainsproject.org)"
Thank you so much for that, I'll have it fixed before resuming crawling process.
Do also please explicitly document which User-agent you respond to in robots.txt!
Thanks for dealing with this quickly.
Rgds
Damon
Fixed + doc: https://github.com/tb0hdan/domains/blob/master/README.md#disabling-domains-project-bot-access-to-your-website