Content-Type grammar inconsistent with examples
From WARC 1.1 section 5.6:
(or ‘application/http; msgtype=request’ and ‘application/http; msgtype=response’ respectively)
Note the space after the semicolon. However the grammar immediately following this prose disallows spaces in this position. It only allows them in a parameter value when enclosed in a quoted-string.
media-type = type "/" subtype *( ";" parameter )
type = token
subtype = token
[...]
token = 1*<any US-ASCII character>
except CTLs or separators>
separators = [...] | SP | HT
It appears revised HTTP standards have addressed this problem as the grammar in RFC 7231 explicitly allows optional white space in this position:
media-type = type "/" subtype *( OWS ";" OWS parameter )
Where OWS is defined in RFC 72301:
OWS = *( SP / HTAB )
; optional whitespace
Future revisions / errata of the WARC standard should make the same grammar correction.
Note that Heritrix writes the Content-Type header for http requests and responses with spaces so a very large number of WARCs in the wild require this grammar change in order to be successfully parsed.
Another example from in the wild: wget generates warcs without the space.