heritrix3 icon indicating copy to clipboard operation
heritrix3 copied to clipboard

Heritrix sometimes writes empty WARC records for redirects

Open anjackson opened this issue 7 years ago • 1 comments

Just noticed an oddity in our crawls. We have a WARC response with no response in it (see below). This seems to be due to the crawler getting a HTTP 204 response.

However, I only think that because the @ikreymer's pywb cdx-indexer creates this CDX line:

com,facebook)/plugins/like.php?action=like&colorscheme=light&height=21&href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105 20180422171119 http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21 unk 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 383 10514026 BL-20180422170134461-00018-63~ukwa-h3-pulse-daily~8443.warc.gz

But frankly I don't understand where it's getting the 204 from!

Assuming it is really a 204 (I'll check the crawl log), the question is: What should Heritrix3 be writing to the WARC file?

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-IP-Address: 157.240.1.35
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Record-ID: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
Content-Length: 0



WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-Concurrent-To: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
WARC-Record-ID: <urn:uuid:1fa4ddfb-2285-48b3-a835-61378b29a1d4>
Content-Length: 0



WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://www.facebook.com/plugins/like.php?href=http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21
WARC-Date: 2018-04-22T17:11:19Z
WARC-Concurrent-To: <urn:uuid:bf7d95c5-0844-4778-9490-1af393b53204>
WARC-Record-ID: <urn:uuid:dc148cb3-2c39-42c5-b1c6-02654fe428b7>
Content-Type: application/warc-fields
Content-Length: 564

via: http://newspig.co.uk/8-reasons-to-hold-cash-markets-are-rational-until-theyre-not/
hopsFromSeed: LLLE
sourceTag: http://newspig.co.uk/
fetchTimeMs: 12
charsetForLinkExtraction: ISO-8859-1
outlink: https://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fnewspig.co.uk%2F8-reasons-to-hold-cash-markets-are-rational-until-theyre-not%2F&layout=button_count&show_faces=false&width=105&action=like&colorscheme=light&height=21 R Location:
outlink: http://www.facebook.com/favicon.ico I =INFERRED_MISC
outlink: http://www.facebook.com/ I =INFERRED_MISC



anjackson avatar May 01 '18 09:05 anjackson

From the extracted links it seems to be a redirect not a 204.

ato avatar Aug 02 '18 12:08 ato