warcit
warcit copied to clipboard
URLs of file names containing # are not escaped correctly
trafficstars
Possibly hinting at other escaping issues.
Example:
WARC/1.0
WARC-Date: 2004-11-10T16:15:13Z
WARC-Source-URI: file://waste/images/17#.jpg
WARC-Created-Date: 2018-02-06T16:26:13Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:73015799-0b5a-11e8-9ac5-5ce0c57ec2e1>
WARC-Target-URI: http://heise.de/tp/kunst/waste/images/17#.jpg
WARC-Payload-Digest: sha1:GLC3CKKQ4LSVN4FD75TBXBOOAHA6WP6N
WARC-Block-Digest: sha1:GLC3CKKQ4LSVN4FD75TBXBOOAHA6WP6N
Content-Type: image/jpeg
Content-Length: 5222
Should be:
WARC/1.0
WARC-Date: 2004-11-10T16:15:13Z
WARC-Source-URI: file://waste/images/17#.jpg
WARC-Created-Date: 2018-02-06T16:26:13Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:73015799-0b5a-11e8-9ac5-5ce0c57ec2e1>
WARC-Target-URI: http://heise.de/tp/kunst/waste/images/17%23.jpg
WARC-Payload-Digest: sha1:GLC3CKKQ4LSVN4FD75TBXBOOAHA6WP6N
WARC-Block-Digest: sha1:GLC3CKKQ4LSVN4FD75TBXBOOAHA6WP6N
Content-Type: image/jpeg
Content-Length: 5222
Fixed by https://github.com/webrecorder/warcit/pull/2
Fixed in https://github.com/webrecorder/warcit/pull/2