webdriver
webdriver copied to clipboard
Cookie name/value serialization
Since cookie values at least are a byte sequence, it needs to be specified how these are converted to a string and vice versa.
Oh no, I guess we don't handle this at all. It would be really useful to know what happens in implementations today if you make a cookie with a non-UTF8 serializable value.
That only works in Chrome at the moment, but we are considering aligning Firefox with that behavior (Fx is limited to UTF-8). Safari only supports ASCII I think, but I did not test exhaustively. (Those cookies in Chrome are not serialized for document.cookie.)
From a quick conversation with @annevk we think it makes sense to transfer raw bytes from the browser without parsing into UTF-8 or so in between. To do that we could pass a number array or a ByteString-like format.
Perhaps an uninformed take (please correct me if so) but from what I understand one of the sticking points is trying to pick a useful representation of the cookie's octets.
I like the idea of a byte string, and it may in fact be necessary to support non-printable cookie octets, but that's not useful to the human on the other end. What if, in addition to the byte field, there was also a human readable field? Any non-printable characters would be given some default printable value but it would otherwise represent the same data.
This would allow a testdriver.js user to more easily compare the results of their query to their expectation. Instead of having them manually convert foo=bar to 0x660x6F0x6F=0x620x610x72. They could just look at the name_human_readable and value_human_readable(name tbd) fields instead.
If the expected cookie has non-printable characters in it then you'd of course have to look at the byte string field instead. But considering that the vast majority of our cookie tests use printable characters that seems fine.
I suppose you could also just add a helper function that the user could call to do the human readable conversion, but having it built into the JSON object seems appealing.
From a spec point of view the main concern I have is compatibility; we can't change the existing fields from a string to e.g. an array since that will break consumers. We probably can change it from a plain string to an encoded string as long as the encoding is a noop for UTF8 values. We could add an extra field but we'd still need to specify how to serialize the existing field.
I'm confused by the example above. A ByteString (see Web IDL) is ASCII-compatible so "foo=bar" would be "foo=bar". In fact, all bytes get mapped to their code point of equal value. This mainly looks weird if those bytes also happen to be UTF-8-compatible byte sequences. As this is mainly for testing I'm not sure it's worth optimizing for that, but I suppose one could invent a format that mostly looks like normal Unicode strings but can also represent arbitrary bytes through some kind of escape hatch?
Seems like it was an uninformed take, thanks for your feedback!
As I understand it the sticking point is that bytestrings cannot be directly represented within JSON. I assume this is at least partially due to the fact that some bytes can map to non-printable characters while JSON is meant to be human readable. (The unicode chart for basic latin seems to imply that these should all have printable representations, but my attempts to copy paste here were not fruitful)
Is there any reason not to convert to a JSON string, including escaping non-printable characters, directly? As you mentioned, Anne, this is primarily for testing so for the majority of cases it should "just work" while users testing for control characters or the like would just need to expect escape sequences in the returned fields.
ByteString can only contain U+0000 through U+00FF (each representing a byte). While not all of those are "printable", I don't think JSON actually puts limits on those particular code points?
Oh I see the proposal is to use a string, as the JSON side encoding, but to hold the byte representation directly. If Gecko and Blink are already allowing UTF-8 strings that feels like it could break existing users. It also seems pretty surprising; pretty much any modern consumer is going to convert a JSON string into the language's Text type by default, and it's very unusual for that to actually contain bytes.
I think I'd prefer the case where we know we have valid UTF8 bytes to just encode them as a string directly in the value field. Then in the case where the bytes can't be decoded as UTF8 we also have a value_bytes field that's an array of integers. In this case I'm not sure what should happen with the value field (e.g. whether it should be some attempt to partially encode the value as a string, or if it should be empty, or null or something else. In any case it seems like an edge case that an insufficiently careful client is unlikely to handle well),
For ease of usage I'd probably always include both fields then. Seems a bit annoying for consumers to have to branch.
While not all of those are "printable", I don't think JSON actually puts limits on those particular code points?
If I'm reading the JSON string definition correctly it seems there are restrictions on ", \, and "control characters". I'm not clear on what set of characters "control characters" is defined as but I'm expecting it's at least the branches shown in that page's figure. I.e. "backspace", "formfeed", etc
So if a cookie value contains the octet 0x5C \ that would need to be encoded as \\ in the final JSON respresentation.
@jgraham If I'm reading your comment correctly it seems we're both suggesting the same thing? I.e. a byte representation and a human readable version.
I see, according to https://www.ecma-international.org/publications-and-standards/standards/ecma-404/ certain code points need to use their escaped representation. However, that is for the wire-representation, consumers of JSON wouldn't see the difference between that and the actual code point, so I think I'm still not entirely sure what your proposal/point is.
I'm assuming that as a user I'd be calling Get All Cookies and then work directly with the returned list of serialized cookies.
Since a serialized cookie is a JSON object I also assume that its fields' values have the same restrictions as JSON. If those assumptions don't hold then my suggestion doesn't make much sense.
But you get a "JSON object", right? Not a serialized JSON object. I.e., the result of JSON.parse() from what comes over the protocol. As such you don't have to deal with the pre-JSON.parse() particulars of serializing a string, I'd imagine.
If that's the case then I suppose it would just work if you serialize the byte string representing the cookie.
I'm quite confused. In psuedo code, the serialization would look something like
json.serialize({
name: cookie.name,
value: String.from_latin1(cookie.value.as_bytes())
// Other fields
})
i.e. you'd get the byte representation of the value, and convert it to a string in some encoding that just maps all the bytes 0-0xFF to the same codepoint. So a cookie with value "ü" would be represented on the wire as the JSON string "ü", assuming that the original bytestring is a UTF-8 encoding of that codepoint. Is that what's being suggested? If so, that seems very user-unfriendly for the presumably-common case of valid UTF-8 (that isn't ascii/latin-1).
I'm not sure, but I somewhat doubt that is common as HTTP headers don't really contain UTF-8. It's document.cookie that does that.
FWIW the webRequest.HttpHeaders web extensions API uses either value or binaryValue to represent headers that aren't UTF-8 encoded