Alex Osborne
Alex Osborne
* cdx command should be able to post records to a cdx server * replay server should be able to use a cdx server as a record index
Lots of tricky details: * How do we map URLs to file paths? * What if a WARC contains several versions of the same URL? * How do we handle...
Motivation: * To allow the recording of messages using a different representation to their wire message format as - the write protocol may be suboptimal for the purposes of storage...
A discussion between @ikreymer and @ibnesayeed discovered that in both WARC 1.0 and 1.1 the revisit record example uses message/http as the content-type whereas everywhere else in the standard application/http...
It's defined as: token = 1* except CTLs or separators> But presumably the extra `>` character at the end of the first line shouldn't be there and the definition should...
From WARC 1.1 section 5.6: > (or ‘application/http; msgtype=request’ and ‘application/http; msgtype=response’ respectively) Note the space after the semicolon. However the grammar immediately following this prose disallows spaces in this...
From section 10.6 "Example of ‘revisit’ record": HTTP/1.x 304 Not Modified The string "HTTP/1.x" is an invalid HTTP-version per the grammar in RFC 7230: HTTP-version = HTTP-name "/" DIGIT "."...
In section 6.7.3 "Profile: Server Not Modified": > To indicate this profile, use the URI: > http://netpreserve.org/warc/1.1/revisit/server-not-modified In section 10.6 "Example of ‘revisit’ record" the URI is missing the '/revisit/'...
WARC inherited line folding from HTTP which presumably included it for compatibility with MIME messages which have line length limits. The newer HTTP RFCs [deprecated it](https://datatracker.ietf.org/doc/html/rfc7230#section-3.2.4) and disallowed its use...