heritrix3
heritrix3 copied to clipboard
Heritrix sometimes writes empty WARC records for redirects
Just noticed an oddity in our crawls. We have a WARC response with no response in it (see below). This seems to be due to the crawler getting a HTTP 204 response.
However, I only think that because the @ikreymer's pywb cdx-indexer creates this CDX line:
com,facebook)/plugins/like.php?action=like&colorscheme=light&height=21&href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105 20180422171119 http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21 unk 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 383 10514026 BL-20180422170134461-00018-63~ukwa-h3-pulse-daily~8443.warc.gz
But frankly I don't understand where it's getting the 204 from!
Assuming it is really a 204 (I'll check the crawl log), the question is: What should Heritrix3 be writing to the WARC file?
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-IP-Address: 157.240.1.35
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Record-ID: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
Content-Length: 0
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-Concurrent-To: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
WARC-Record-ID: <urn:uuid:1fa4ddfb-2285-48b3-a835-61378b29a1d4>
Content-Length: 0
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-Concurrent-To: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
WARC-Record-ID: <urn:uuid:dc148cb3-2c39-42c5-b1c6-02654fe428b7>
Content-Type: application/warc-fields
Content-Length: 564
via: http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/
hopsFromSeed: LLLE
sourceTag: http://newspig.co.uk/
fetchTimeMs: 12
charsetForLinkExtraction: ISO-8859-1
outlink: https://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fnewspig.co.uk%2F8-reasons-to-hold-cash-markets-are-rational-until-theyre-not%2F&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21 R Location:
outlink: http://www.facebook.com/favicon.ico I =INFERRED_MISC
outlink: http://www.facebook.com/ I =INFERRED_MISC
From the extracted links it seems to be a redirect not a 204.