warc-specifications
                                
                                 warc-specifications copied to clipboard
                                
                                    warc-specifications copied to clipboard
                            
                            
                            
                        HTTP/2: Server push proposal (WARC-Push-Promised-From)
HTTP/2 introduced a new server push mechanism where instead of a client sending a request the server can anticipate a request a client is expected to make and 'push' a response pre-emptively. When it does this the server actually sends the client a HTTP request:
Server push is semantically equivalent to a server responding to a request; however, in this case, that request is also sent by the server.
[Footnote on teminology: The HTTP/2 specification seems to use the terms 'pushed' and 'promised' semi-interchangeably and it's not really clear to me if there's a distinction in meaning. The binary message frame containing the request is called PUSH_PROMISE. It could be that 'push' refers more to a pushed resource and 'promise' refers to the opening of a new stream. I'm really not sure.]
A typical exchange might look like this:
Client> stream=1 [normal request] GET /
Server> stream=1 [push_promise(stream=2) request] GET /cat.jpg
Server> stream=1 [push_promise(stream=3) request] GET /style.css
Server> stream=1 [response] 200 text/html
Server> stream=2 [response] 200 image/jpeg
Server> stream=3 [response] 200 style/css
Note that the HTTP/2 is multiplexed so all three response messages may be sent in parallel. The push promised requests must be sent before any response however to avoid a race.
There are only 3 differences between a push promised request and a normal request:
- Push promised requests are sent by the server instead of the client.
- Push promised requests are associated with a prior client request.
- Push promised requests are not allowed to have a request body.
Push promised responses are identical to normal responses and aren't a new type of message at the protocol level.
So how do we record this situation in a WARC file?
It seems clear that the response record does not need any special handling as it's intended to be treated just like a regular response and may populate caches and so forth as normal.
The push promised request however seems to need new handling of some sort. At first I was certain we should define a new record type for these requests as it seems a new type of thing. In a discussion on Slack @ikreymer suggested it might actually be preferable to also reuse the existing request record. The argument he made was that existing tools like Pywb that lookup the request in order to retrieve information like the request method would just work.
I guess the question hinges which way is better for compatibility. I haven't been able to come up with a concrete scenario where interpreting a push request as a normal request would break existing tools. So I'm now finding myself swayed by Ilya's argument. Note that the WARC spec doesn't actually say that a request must be client to server so it doesn't seem to actually cause any wording inconsistencies.
During the Slack discussion we worked out a model for how things might work if promised requests were recorded in regular WARC 'request' records.
The model
The WARC-Concurrent-To field would continue to be the way to associate requests with responses even for the new records. However it is essential to avoid confusion that WARC-Concurrent-To should not be used between the normal request and response and any subsequent pushed ones.
So we introduce a new field WARC-Push-Promised-From on the pushed request who's value is the record id of the record of the original request the client sent the server. The presence of this field also serves as the way to identify the request is a push promised request rather than a normal.
The example exchanged I outlined above would result in six records linked together like so:

WARC-Push-Promised-From field definition
General
The WARC-Push-Promised-From indicates that this request was sent by the server as a push promised request. The value of the field is the WARC-Record-ID of the 'request' record holding the originating client request message.
WARC-Push-Promised-From = "WARC-Push-Promised-From" ":" "<" uri ">"
The promised 'request' and its corresponding pushed 'response' or 'revisit' record should be linked to each other via the WARC-Concurrent-To header but to avoid ambiguity must not be link via concurrent-to to the original client 'request' or its corresponding 'response', 'revisit' or 'metadata' records.
The WARC-Push-Promised-From field may be used in 'request' records and shall not be used in 'warcinfo', 'response', 'metadata', 'conversion', 'resource', 'revisit' or 'continuation' records.
HTTP/2 protocol
The WARC-Target-URI field of a push promised 'request' record should be derived from the :scheme, :authority and :path pseudo headers. Note that RFC 7540 requires that HTTP/2 clients reject push promises when the server does not have authority over the target URI. Programs writing WARC records should ensure their HTTP/2 client implements this requirement in order to prevent spoofed URIs being recorded.
In the HTTP/2 protocol a push promised request cannot carry a payload.
Questions
Should WARC-Push-Promised-From point to a request or to a response?
There is a practical reason to favour linking pushes to the original client request:
Pushed resources may be fully received before the client requested response header is even sent. The client requested response may in fact never arrive (because of errors, time limits, your crawler is configured to only save files of a certain content-type etc). You might still want to save the pushed resources anyway but if we link to the response there's no way to do that.
The downside is to use the information at replay time you might have to do more lookups. Although unless sites start abusing the mechanism we shouldn't need to replay pushes. You can and probably should build a more clever index that pre-associates everything.
Are push requests always in response to a normal client request? Or can you have a freestanding push request?
RFC 7540 says:
Pushed responses are always associated with an explicit request from the client.
Can a server push a URI it doesn't own?
Yes, but the client is required to reject it.
Do we need to record rejected push requests?
Open question.
So the issues Ilya raised in #4 about lack record-id indexes apply here equally. Should we define -Target-URI and -Date variants of every new field that could point at an existing record?
Server push is being removed from Chromium https://groups.google.com/a/chromium.org/g/blink-dev/c/K3rYLvmQUBY/m/vOWBKZGoAQAJ?pli=1
"Almost five and a half years after the publication of the HTTP/2 RFC, server push is still extremely rarely used." "Server push is very difficult to use well." "There is significant code complexity associated with Chromium supporting push. [...] We believe that [this complexity] outweighs the theoretical performance benefits."
Thank (DEITY|EXPLETIVE) for that.