wail icon indicating copy to clipboard operation
wail copied to clipboard

Crawls of mkdc only return DNS record in WARC

Open machawk1 opened this issue 4 years ago • 2 comments

Tested in both the basic and advanced interface, tried crawling https://matkelly.com and the default https://matkelly.com/wail, both resulting WARCs only contain the DNS record.

Other URIs seem to produce the correct results.

machawk1 avatar Mar 21 '20 01:03 machawk1

Promoting this issue via pinning to give it priority.

Received a report from Wyeth Lynch trying to capture https://www.sdstate.edu/covid-19 with WAIL 2019.05.21. I replicated this in the latest master and only saw a DNS captured.

ezgif com-video-to-gif

Need to recheck the generated Heritrix configuration to see what this is occurring.

Also, this UI/UX needs to be refined to give users the impression that the crawl does not immediately complete, e.g., give direct access via a link or a button to the crawl status.

machawk1 avatar May 14 '20 14:05 machawk1

This might be attributed to the startup script including the correct Heritrix libraries per http://web.archive.org/web/20110928012834/http://tech.groups.yahoo.com/group/archive-crawler/message/772 .

The newer releases of Heritrix, when installed in WAIL, do not seem to exhibit the problem. A next-step might be to diff the startup scripts.

machawk1 avatar Jul 22 '21 21:07 machawk1