ipwb icon indicating copy to clipboard operation
ipwb copied to clipboard

Does ipwb handle segmented response records?

Open machawk1 opened this issue 7 years ago • 5 comments

The WARC/1.1 spec (Section B.8) gives an example where a response record is segmented into multiple other smaller records. This changes the hash digests of the records both in the context of the WARC-Block-Digest and WARC-Payload-Digest fields in the warc-response and continuation records but also in ipwb, which also likely calculates the multihash of the content in the initial response records and does not consider other segments.

Let's check the implementation of the module we are using to extract the warc-response records with some dummy data along with a continuation records and a WARC-Segment-Number field in the initial (and potentially subsequent) records.

Ideally, it would be useful to have a data set of WARC exercising all of the features a la a set of minimum working examples but I have yet to come across such a data set. The key here would be MINIMAL examples without the cruft that may trip up other process and produce true/false positives/negatives

machawk1 avatar Jan 30 '18 01:01 machawk1

Since we are relying on PyWB for WARC parsing, we can offload this responsibility there. In fact I would much prefer to move to the new warcio library for WARC parsing.

/cc @ikreymer

ibnesayeed avatar Jan 30 '18 16:01 ibnesayeed

@ibnesayeed There is offloading the responsibility and verifying whether ipwb does the right thing currently. This ticket is about verifying the correctness in ipwb.

As an aside, I believe there was an effort to move to warcio at one point but something about the difference in the Iteration approach used kept that from moving forward.

machawk1 avatar Jan 30 '18 17:01 machawk1

I thought the hiccup was due to Python version, but I might be wrong.

ibnesayeed avatar Jan 30 '18 21:01 ibnesayeed

Yeah, you should use warcio directly for reading the WARC, the latest of pywb just uses warcio as well.

ikreymer avatar Jan 30 '18 21:01 ikreymer

@ibnesayeed warcio reporting being compatible with Python 2, so this might have not been the issue. Hopefully that will be moot when we finish #51. We discussed utilizing parts of warcio in #129 and #211.

@ikreymer Can you report on how warcio handles continuation record(s) chained from with a warc-response record?

machawk1 avatar Jan 30 '18 21:01 machawk1