Greg Lindahl
Greg Lindahl
I'm planning on writing a jumbo article on what "warcio test" finds, and this will end up being a part of that, if you don't beat me to it. I've...
That's an easy problem to see because many http clients will undo chunked encoding and decompress by default. So you can't depend on the checksum not changing. And even if...
I thought my latest pullreq (which got pulled into the devel branch) fixed that bug, and whatever I have installed locally sees digest pass. tests/data/example-iana.org-chunked.warc works fine for me, too,...
(Your unchunking code wasn't reading the last couple of bytes... ;-) )
Whoops, I fixed the bug in my develop-warcio-test branch, which is still a work in progress: ``` if not chunk_size: # chunk_size 0 indicates end of file + final_data =...
For the examples I mentioned at the top, one common tool was different and it was still considered a bug by IIPC... wget started putting around the URI in WARC-Target-URI,...
I think the correct place is the IIPC issue tracker, examples: https://github.com/iipc/warc-specifications/issues/49 https://github.com/iipc/warc-specifications/issues/33 https://github.com/iipc/warc-specifications/issues/48 ... ideally one of us interested people would do a bit of a survey to see...
I actually made a mistake in "warcio check" with the streaming interface! Triggering a digest check requires reading all of the payload, and I did that with ``` record.content_stream().read() ```...
@dlazesz your solution does not work in a streaming environment. The solution at the top works for both streaming and non-streaming. I think your solution also breaks the checksum code....
Work in progress -- now a pullreq https://github.com/webrecorder/warcio/pull/66 ``` $ warcio test test/data/*.warc.gz test/data/*.warc test/data/example-bad-non-chunked.warc.gz saw exception ERROR: non-chunked gzip file detected, gzip block continues beyond single record. This file...