warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

WARC-Protocol field proposal

Open ato opened this issue 7 years ago • 30 comments

Motivation:

  • To allow the recording of messages using a different representation to their wire message format as
    • the write protocol may be suboptimal for the purposes of storage and replay; or
    • the raw bytes of the wire protocol may be unavailable For example it was proposed in #15 and #41 to allow HTTP/2 messages to be represented as application/http.
  • To allow the presence of layered protocols like TLS to be recorded.
  • To allow readers of WARC files to be able to determine the protocol of a message without having to know how to parse the record block.
  • To disambiguate when the protocol cannot be determined from the message itself. Many protocols, including HTTP/2 and SPDY, negotiate protocol version up front and subsequent messages are not tagged with a protocol identifier.

WARC-Protocol field definition

The WARC-Protocol field denotes the protocol(s) of the original network message this record holds information about.

WARC-Protocol = "WARC-Protocol" ":" protocol-id
protocol-id = "dns"      ; DNS [RFC 1035]
            | "ftp"      ; FTP [RFC 959]
            | "gemini"   ; Gemini
            | "gopher"   ; Gopher [RFC 1436]
            | "http/0.9" ; HTTP/0.9
            | "http/1.0" ; HTTP/1.0 [RFC 1945]
            | "http/1.1" ; HTTP/1.1 [RFC 7230]
            | "h2"       ; HTTP/2 over TLS [RFC 7540]
            | "h2c"      ; HTTP/2 over cleartext TCP [RFC 7540]
            | "h3"       ; HTTP/3 [RFC 9114]
            | "quic/1"   ; QUIC version 1 [RFC 9000]
            | "quic/2"   ; QUIC version 2 [RFC 9369]
            | "spdy/1"   ; SPDY/1
            | "spdy/2"   ; SPDY/2
            | "spdy/3"   ; SPDY/3
            | "ssl/2"    ; SSLv2 aka SSL 0.2
            | "ssl/3"    ; SSLv3 aka SSL 3.0 [RFC 6101]
            | "tls/1.0"  ; TLS 1.0 [RFC 2246]
            | "tls/1.1"  ; TLS 1.1 [RFC 4336]
            | "tls/1.2"  ; TLS 1.2 [RFC 5246]
            | "tls/1.3"  ; TLS 1.3

If the protocol you wish to record is not on the list above please file an issue to propose a protocol identifier before using it.

The WARC-Protocol field may be omitted when the protocol is unknown or can be unambiguosly determined from some combination of the scheme portion of the WARC-Target-URI field, the Content-Type field and the message in the record block itself.

Multiple WARC-Protocol fields may be present to indicate protocol layering. For example HTTP/1.1 over TLS 1.0 would be indicated by:

WARC-Protocol: http/1.1
WARC-Protocol: tls/1.0

The WARC-Protocol field does not indicate the format of the record block and is not a replacement for the Content-Type field. For example the use of an extended text format that includes HTTP/2 pseudo-headers should be indicated by a new value of the Content-Type field not the presence of WARC-Protocol: h2.

Different protocols may reuse the same media type. There are also situations where it may be desirable to represent the same message of a particular protocol using different types such as semantically equivalent text and binary forms.

The WARC-Protocol field may be used in 'request', 'response', 'resource', 'metadata' and 'revisit' records and shall not be used in 'warcinfo', 'conversion' and 'continuation' records.

Determining the protocol in the absence of WARC-Protocol

URI Scheme Content-Type Header version Protocol
dns text/dns dns ; transport unknown
ftp ftp ; over cleartext TCP
gemini application/gemini † gemini ; over TLS #85
gopher application/gopher † gopher ; over cleartext TCP
http application/http absent http/0.9 ; over cleartext TCP
http application/http "HTTP/1.0" http/1.0 ; over cleartext TCP
http application/http "HTTP/1.1" http/1.1 ; over cleartext TCP
https application/http "HTTP/1.0" http/1.0 ; over TLS
https application/http "HTTP/1.1" http/1.1 ; over TLS

† Not a registered media type but has been used in the wild.

When the WARC-Protocol field is present it takes precedence over the rules in the table above.

Edit 2023-05-31: Added 'revisit' to list of allowed records. Edit 2023-06-01: Added Gemini protocol as proposed by @acidus99 in #85. Edit 2023-06-02: Added Gopher protocol as proposed by @TheTechRobo in #87. Edit 2024-07-15: Added h3 (HTTP/3) Edit 2024-11-18: Added quic/1 and quic/2. Added clarifying example about pseudo-headers.

ato avatar Jul 13 '18 03:07 ato

What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art.

nlevitt avatar Jul 16 '18 18:07 nlevitt

What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art.

nlevitt avatar Jul 16 '18 18:07 nlevitt

Maybe we could say, please file a github issue here to propose a new protocol id, before you use it.

I think that's a great idea. I've updated the proposal text to include a link to an issue template.

ato avatar Jul 17 '18 01:07 ato

h2c and h2 are obvious odd ones out in the list as they don't follow the general name/version form and h2 vs h2c is somewhat redundant with specifying the TLS version. I did it that way for consistency with the identifiers the RFC itself says to use in the HTTP Upgrade header and the ALPN protocol identifier field.

Also I just made up the TLS protocol identifiers as I couldn't find anything semi-official. "SSLv3", "TLSv1.1" etc seems somewhat common in software though (Java, OpenSSL) so I can see an argument that might be a better choice. I don't think there's a right answer here, the slash form is better in the sense that you could consistently chop the version off. The "TLSvX" form is better in that you might not have to convert from whatever TLS library you're using says. I couldn't see one argument as particularly more compelling than the other so just picked one.

ato avatar Jul 17 '18 02:07 ato

After proposing in #52 that WARC-Software-Version follow the format of HTTP User-Agent I find myself thinking WARC-Protocol as also a list of version numbers should also be consistent with it. I keep going back on forth on it as I think there's arguments either way.

In favour of a single field in the style of User-Agent:

  • It makes WARC fields easier to deal with in most programming languages as you can just dump them into a hash table (with the exception of WARC-Concurrent-To).
  • I like the idea of using a consistent mini-language across all three headers (User-Agent, WARC-Software-Version, WARC-Protocol) to specifying component version numbers. It also leads to the obvious extension of allowing comments with more details for diagnostic/troubleshooting purposes.
  • It's more concise which makes records more human readable.

In favour of repeated fields:

  • It doesn't require field-specific parsing.
  • WARC does allow specific fields to be repeated so that's something readers have to account for anyway.
  • It's simpler to write a matching expression for generic filtering tools.

ato avatar Mar 06 '19 00:03 ato

I have a question on which record types the WARC-Protocol header, as well as the WARC-TLS-Cipher-Suite header mentioned/proposed by @ato here should appear.

  • Both a request and a response can travel on top of a TLS connection, so presumably these headers could appear on both the request and response records. But should they?
  • A client cannot change the TLS version of cipher suite between a request and a response, so the header values would be identical for request/response record pairs. Including it on both seems like needless duplication, especially if the records are linked with a WARC-Concurrent-To.

The most similar, already defined header I could think of to this is WARC-IP-Address. Section 5.10 of the 1.1 spec says "the numeric Internet address contacted to retrieve any included content" and can be associated with request and response records. But all the examples in the spec only show the WARC-IP-Address header on response records, and I haven't ever encountered any WARCs in the real world that use WARC-IP-Address on the request records.

(Which is kind of weird if you think about it from an order-of-operations perspective. The IP address of the system must be known before the request is made, so it's odd that the convention is to include the WARC-IP-Address header on response instead of the request.)

It feels like the WARC-Protocol and WARC-TLS-Cipher-Suite headers should go where the WARC-IP-Address header goes, but I really am curious to the community's feedback.

acidus99 avatar May 30 '23 21:05 acidus99

I haven't ever encountered any WARCs in the real world that use WARC-IP-Address on the request records.

Here are some tools that do: wget, wpull, qwarc, Zeno, warcio (at least when using warcio.capture_http). I'm sure there are more. Heritrix and warcprox don't. If you want some real-world example WARCs, the ArchiveTeam collection on the Internet Archive is full of them.

I think that they should be allowed on both request and response records. As for why you might want to record it on the request record: consider the case where you send a request but never receive a response. It is still worth recording this attempted request (and note the lack of a response in the log accompanying the crawl), including the relevant details like IP and protocol.

JustAnotherArchivist avatar May 30 '23 23:05 JustAnotherArchivist

I just realized I missed the 'revisit' record type in the WARC-Protocol proposal, so have edited it to be included. After this edit WARC-Protocol is allowed on the same record types as WARC-IP-Address (‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’).

Some reasons for allowing it on multiple record types:

  • In some cases the request and response may use different protocol versions. (e.g. http/1.0 vs http/1.1)
  • You may have information about the protocol that was used but not have the actual request or response message. This can occur for example when converting to WARC from another format or due to tool limitations (e.g. in-browser archiving).

it's odd that the convention is to include the WARC-IP-Address header on response instead of the request

It's likely because:

  1. The older ARC file format did not store the request but did store the IP address.
  2. Before the advent of browser-based crawling, request records were usually completely ignored and not indexed for replay. So if you're going to put it in just one record then choosing the response record would make it more easily accessible to replay tools.

ato avatar May 31 '23 01:05 ato

Excellent, thanks for the context. I ended up including them on both request and response records

acidus99 avatar May 31 '23 13:05 acidus99

After proposing in #52 that WARC-Software-Version follow the format of HTTP User-Agent I find myself thinking WARC-Protocol as also a list of version numbers should also be consistent with it. I keep going back on forth on it as I think there's arguments either way.

It seems like this hasn't been decided one way or another, but would very much be in favor of a single field, as that makes representing WARC headers as dictionary object much easier and more concise. Are there other WARC headers that allow repetition currently?

The repeatable Set-Cookie and Link HTTP headers require special parsing, but also have custom semantics that make sense to have separate. As this is much simpler header, I think a comma-separated value list makes a lot of sense, in line with other headers like Accept*, Vary, etc...

ikreymer avatar Jul 12 '24 05:07 ikreymer

Are there other WARC headers that allow repetition currently?

WARC-Concurrent-To is the only one in the standard:

As an exception to the general rule, several WARC-Concurrent-To fields may be repeated within the same WARC record.

The only other standard headers that would seem to make sense to repeat are the payload/block digest headers for different algorithms. But that's not allowed currently.

Repetition of extension headers was also discussed in #95. I haven't seen any other extension headers in the wild that use repetition or comma separated lists so far.

It's not WARC record headers but Heritrix uses repeated fields in application/warc-field metadata records to record extracted links.

ato avatar Jul 13 '24 16:07 ato

I'm in favor of a single field, comma-separated.

Note that the clock has pretty much ticked out on this discussion... the minute that a large web player starts discriminating against crawling with http/1.1 and less so against crawling with http/2, we have to switch immediately.

wumpus avatar Jul 16 '24 01:07 wumpus