pook pook can't handle binary headers

pook can't handle binary headers

Open jaydev opened this issue 6 years ago • 2 comments

If a response header is binary, for example something like ('Content-Type', b'application/pdf'), when pook tries to iterate over the headers, it is expecting a string and throws an error. The problem is in HTTPHeaderDict.itermerged.

Mar 07 '18 19:03 jaydev

This also applies to binary content. For example, if you pass in a bytestring as content, it'll choke at https://github.com/h2non/pook/blob/master/pook/response.py#L158-L159. Assuming all byte data is utf-8 by default is an odd decision.

Jul 18 '18 13:07 daredevil82

Yeah, are you interested in providing a fix? If yes, please, provide some test coverage as well.

Jul 26 '18 11:07 h2non

@jaydev @daredevil82 I'm trying to work on a fix for this issue, but am stuck on a decision that I'm not sure how to handle. I also want to mention that binary bodies are a separate issue that is already fixed.

Never mind, I found what I think is a much simpler and far more acceptable way to cleanly handle non-ASCII binary data using Python's unicode_escape encoding. What I wrote here isn't relevant anymore, but I want to leave it just in case. I've opened #90 with the fix and would appreciate input from any of the folks on this issue, if y'all are willing.

The crux of my difficulty with this is that HTTP headers must be ASCII, so even if receiving bytes, they need to be ASCII decodable. However, applications don't always work this way, so I want to make sure that any changes do not break fundamental API expectations or decrease flexibility that would break important edge cases that I'm not considering. Take the following example:

mock = pook.get('https://example.com')
mock.headers({'Content-Type': 'application/pdf'.encode('ascii')})

request = pook.Request()
request.url = 'https://example.com'
request.headers = {'Content-Type': 'application/pdf'}

Should the above request match the mock? My inclination is to say yes, it should, primarily because the specification notes that header field values must be USASCII: https://www.rfc-editor.org/rfc/rfc7230#section-3.2.4

On the other hand, the specification also says:

A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

With that in mind, is it appropriate to attempt ASCII decoding of bytes to compare against a string value for a string header value matcher? One issue with this is that HTTPHeadersDict creates a string representation of a multi-value header, following the spec and separating each value with a comma. That's where byte header values currently fail, because str::join requires Iterable[str] and fails if bytes are present in the iterable.

The ability to combine header values for matching seems important, but I'm struggling to figure out what the "safe" assumption is. ', '.join([b'']) simply fails, but if .decode('ascii') also fails, what's the best approach? I considered potentially just running the value through b64encode instead, but if the user isn't expecting that, the match error will be very confusing without explaining that the header field data needed to be b64encoded.

Anyway, what do y'all think? Any ideas for how best to approach this? Attempting an ASCII decoding for each header value and always comparing string-to-string, and in particular continuing to raise an exception if ASCII decoding fails, seems like a fine stop-gap solution to me to at least support the most common case, where a header value was cast to bytes for one reason or another. Do you agree? Or is there an important edge case I'm not considering? Thanks in advance for any input and ideas y'all have.

Oct 27 '23 04:10 sarayourfriend

pook pook copied to clipboard

pook can't handle binary headers

pook
pook copied to clipboard