heritrix3 icon indicating copy to clipboard operation
heritrix3 copied to clipboard

Question re: cloudfront.net

Open carj opened this issue 3 years ago • 1 comments

I'm trying to crawl a company website using the seed www.example.com and Heritrix is either generating no warc file or just empty warcs. I'm using a simple one line seed of the company website.

If i do get something back in the warc file it looks like the following. Is there something i should be adding to the beans file to make the crawl work. I'm using the default beans file from the latest release.

Why does the crawl just return the DNS records?

Thanks for any assistance.

WARC/1.0 WARC-Type: response WARC-Target-URI: dns:www.example.com WARC-Date: 2022-05-27T16:38:57Z WARC-IP-Address: 172.30.0.2 WARC-Record-ID: urn:uuid:e2196c71-7dec-4163-94d4-bb64934888a6 Content-Type: text/dns Content-Length: 224

20220527163857 d1a4lrim8ynrpr.cloudfront.net. 60 IN A 13.249.39.29 d1a4lrim8ynrpr.cloudfront.net. 60 IN A 13.249.39.116 d1a4lrim8ynrpr.cloudfront.net. 60 IN A 13.249.39.81 d1a4lrim8ynrpr.cloudfront.net. 60 IN A 13.249.39.52

carj avatar Jun 01 '22 09:06 carj

I think some more information is required. What does your crawl log look like when it's running? Could the site have a restrictive robots.txt? Can you connect with something like curl?

NGTmeaty avatar Jun 06 '22 04:06 NGTmeaty