archiveweb.page icon indicating copy to clipboard operation
archiveweb.page copied to clipboard

Incorrect handling of multiple headers

Open phiresky opened this issue 4 years ago • 2 comments

While implementing firefox support, I noticed the following:

The CDP Network.Headers interface returns headers as a JSON object of keys and values, while the webRequest API returns a list of keys and values.

HTTP Headers can appear multiple times though, so how does CDP handle multiple of the same header?

I tested this with a tiny test server, and when the response contains two values for the same header:

HTTP/1.1 200 OK
X-Powered-By: Express
Set-Cookie: foo=bar; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT
Set-Cookie: baz=hio; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 12
ETag: W/"c-Lve95gjOVATpfV8EL5X4nxwjKHE"
Date: Sat, 06 Mar 2021 14:28:57 GMT
Connection: keep-alive
Keep-Alive: timeout=5


The following is stored in the database:

respHeaders:
Connection: "keep-alive"
Content-Length: "12"
Content-Type: "text/html; charset=utf-8"
Date: "Sat, 06 Mar 2021 14:28:57 GMT"
ETag: "W/"c-Lve95gjOVATpfV8EL5X4nxwjKHE""
Keep-Alive: "timeout=5"
Set-Cookie: "foo=bar; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT, baz=hio; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT"
X-Powered-By: "Express"

The headers are concatted simply with comma. This seems to be "correct" for most headers, but not all of them. For set-cookie for example, the cookies are mangled and not parseable. This means that the resulting web archive will set mangled cookies.

I found some more info here: https://stackoverflow.com/a/38406581/2639190

phiresky avatar Mar 06 '21 14:03 phiresky

Thanks for pointing this out! Probably should add a way to split these up when storing the headers. Cookies are a bit tricky, and actually are not supported being set from a service worker, so the Set-Cookie headers are not replayed. Instead, what happens is that the actual Cookie from the request is saved in another header, and then wabac.js injects that cookie into the page via document.cookie.

This seems sufficient for replay, and actually to be more accurate it should probably save the cookies from document.cookie itself, as only those cookies that are accessible to the document are needed for replay...

ikreymer avatar Mar 10 '21 00:03 ikreymer

actual Cookie from the request is saved in another header

Huh, did not see that. That makes sense. So for replay it doesn't really matter (assuming the Cookie header doesn't have the same problem with concatenation), though it does decrease the fidelity of the resulting web archive somewhat.

phiresky avatar Mar 10 '21 10:03 phiresky