warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

Meaning of "<any OCTET except CTLs, but including LWS>"

Open o018BUm8UQEEY2e5 opened this issue 10 months ago • 3 comments

I'm confused by this rule in the ABNF provided in The WARC Format 1.1:

TEXT          = <any OCTET except CTLs,
                but including LWS>

Which of these (if any) is the correct interpretation:

TEXT          = %x20-7E | %x80-FF | LWS 
TEXT          = %x20-7E | %x80-FF | SP | HT
TEXT          = %x20-7E | %x80-FF | CR | LF | SP | HT

o018BUm8UQEEY2e5 avatar Feb 12 '25 18:02 o018BUm8UQEEY2e5

The first one. CRLF can appear only if immediately followed by SP or HT. This is called line folding. This definition was inherited from the HTTP/1.1 RFC 2616 so you may find the explanatory text in section 2.2 of it helpful.

Note that while the WARC standard allows them, in practice line folding and non-UTF-8 encodings are not well supported, so I recommend WARC writers avoid using them. Those two features were also deprecated in the newer HTTP RFC 7230.

ato avatar Feb 12 '25 22:02 ato

But compliant parsers should still support it?

o018BUm8UQEEY2e5 avatar Feb 12 '25 22:02 o018BUm8UQEEY2e5

Yes. I haven't seen it used in real WARC files in the wild, but a fully compliant parser should support it.

From what I've seen, many (but not all) parsers support line folding but vary in how they interpret it as a string in their header reading API. Some including the LWS sequence as is, others replacing it with a single space or linefeed. I haven't seen any parser that supports the non-UTF-8 'encoded-word' feature though.

ato avatar Feb 12 '25 22:02 ato