warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

Content-Type grammar inconsistent with examples

Open ato opened this issue 7 years ago • 1 comments

From WARC 1.1 section 5.6:

(or ‘application/http; msgtype=request’ and ‘application/http; msgtype=response’ respectively)

Note the space after the semicolon. However the grammar immediately following this prose disallows spaces in this position. It only allows them in a parameter value when enclosed in a quoted-string.

media-type    = type "/" subtype *( ";" parameter )
type          = token
subtype       = token
[...]
token         = 1*<any US-ASCII character>
                except CTLs or separators>
separators    = [...] | SP | HT

It appears revised HTTP standards have addressed this problem as the grammar in RFC 7231 explicitly allows optional white space in this position:

media-type = type "/" subtype *( OWS ";" OWS parameter )

Where OWS is defined in RFC 72301:

     OWS            = *( SP / HTAB )
                    ; optional whitespace

Future revisions / errata of the WARC standard should make the same grammar correction.

Note that Heritrix writes the Content-Type header for http requests and responses with spaces so a very large number of WARCs in the wild require this grammar change in order to be successfully parsed.

ato avatar Jul 09 '18 01:07 ato

Another example from in the wild: wget generates warcs without the space.

wumpus avatar Feb 01 '19 18:02 wumpus