Meaning of "<any OCTET except CTLs, but including LWS>"
I'm confused by this rule in the ABNF provided in The WARC Format 1.1:
TEXT = <any OCTET except CTLs,
but including LWS>
Which of these (if any) is the correct interpretation:
TEXT = %x20-7E | %x80-FF | LWS
TEXT = %x20-7E | %x80-FF | SP | HT
TEXT = %x20-7E | %x80-FF | CR | LF | SP | HT
The first one. CRLF can appear only if immediately followed by SP or HT. This is called line folding. This definition was inherited from the HTTP/1.1 RFC 2616 so you may find the explanatory text in section 2.2 of it helpful.
Note that while the WARC standard allows them, in practice line folding and non-UTF-8 encodings are not well supported, so I recommend WARC writers avoid using them. Those two features were also deprecated in the newer HTTP RFC 7230.
But compliant parsers should still support it?
Yes. I haven't seen it used in real WARC files in the wild, but a fully compliant parser should support it.
From what I've seen, many (but not all) parsers support line folding but vary in how they interpret it as a string in their header reading API. Some including the LWS sequence as is, others replacing it with a single space or linefeed. I haven't seen any parser that supports the non-UTF-8 'encoded-word' feature though.