Zope icon indicating copy to clipboard operation
Zope copied to clipboard

"HTTPRequest.processInputs": broken "character_encoding" handling

Open d-maurer opened this issue 5 years ago • 10 comments

from ZPublisher.HTTPRequest import HTTPRequest
try: from urllib.parse import quote
except ImportError: from urllib import quote

def epi(qs):
  r = HTTPRequest(
    None,
    dict(
      SERVER_NAME="nohost",
      SERVER_PORT="80",
      SCRIPT_NAME="",
      REQUEST_METHOD="GET",
      QUERY_STRING=qs),
    None,
    )
  r.processInputs()
  return r.form

bstring = u'\x80\x81\x82' # 'latin1' decoded sequence of bytes
epi("bstring:latin1:bytes=" + quote(bstring.encode("latin1")))

produces three different results in Zope2, Py3/Zope4 and Py2/Zope4:

Zope2: {'bstring': '\x80\x81\x82'} (the expected result)

Py3/Zope4: {'bstring': b'\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd'}

Py2/Zope4: {'bstring': '\xc2\x80\xc2\x81\xc2\x82'}

Zope 4 assumes that the query string content is encoded via the configuration dependent zpublisher_default_encoding. This is acceptable (only) when the request parameter does not explicitly specify its own character encoding.

dm.zopepatches.ztutils explicitly uses :utf8 to reliably pass around unicode values independent from a potentially different zpublisher_default_encoding.

d-maurer avatar May 18 '19 09:05 d-maurer

After eyeballing the code in both Zope 4 and Zope 2 the statement "Zope 4 assumes that the query string content is encoded via the configuration dependent zpublisher_default_encoding" appears to be true in Zope 2 as well. It's just that the default encoding value has changed.

Using your specific example will always work in Zope 2, as your latin-1 encoded sample string matches the Zope 2 default latin-15 well enough.

Looking at results where the sample string encoding matches neither the Zope 2 nor the Zope 4 defaults may be more enlightening.

dataflake avatar May 18 '19 19:05 dataflake

I have stepped through the code under Python 2 and it seems to do everything correctly. The input gets decoded using the character set you're explicitly passing in, and then re-encoded using the default ZPublisher encoding. The result on Zope 4 is a correctly encoded UTF-8 binary string. Do you expect something different?

Under Python 3 the field value is already messed up when it reaches the code around line 663 that deals with fields that have an explicit character encoding.

So as far as I can tell Zope 4 under Python 3 is indeed broken, but Zope 4 under Python 2 is not. Would you agree? I'm trying to gauge the scope of the issue and what results you would expect.

dataflake avatar May 18 '19 19:05 dataflake

Jens Vagelpohl wrote at 2019-5-18 12:35 -0700:

... I have stepped through the code under Python 2 and it seems to do everything correctly. The input gets decoded using the character set you're explicitly passing in, and then re-encoded using the default ZPublisher encoding. The result on Zope 4 is a correctly encoded UTF-8 binary string. Do you expect something different?

Sure: I expect to get the byte sequence back I have fed in -- not its utf-8 encoding.

... So as far as I can tell Zope 4 under Python 3 is indeed broken, but Zope 4 under Python 2 is not. Would you agree?

No.

I agree with you that Zope2 might have the same principal problem as Zope4+Py2 and that it is only hidden by a different "default_encoding". I am not sure however: I am actively using :utf8:ustring (via dm.zopepatches.ztutils) to reliably pass on unicode values in Plone environments - and I think, Plone uses by default utf-8 as "default_encoding".

But I interpret the "character_encoding" suffix of HTTPRequest.processInputs as the declaration that the affected value is encoded via this explicitly given encoding and not the "default_encoding". Under this interpretation, :latin1:bytes must return the value exactly (seen as a seqeunce of bytes) - not some encoding thereof.

Of course, there is no garantee that my interpretation is correct. Likely, there is no documentation/specification about how the ":"s of ZPublisher.HTTPRequest should work.

I'm trying to gauge the scope of the issue and what results you would expect.

My primary motivation to have looked into this is ZTUtils (respectively, dm.zopepatches.ztutils). In my view, it should support the construction of query string fragments and hidden form controls reproducing an equivalent value in the target.

E.g. make_query(x=v) should reliably give at the destination a request parameter x with value v -- for as much types of v as possible; if possible independent of a specific "default_encoding". If "v" happens to be of type "bytes", then the most natural result of make_query(x=v) is 'x:latin1:bytes=' + quote(v).

This seemed possible for most elementary types, if I can interpret the "character_encoding" suffix as the assertion that the corresponding value uses the specified encoding.

Meanwhile, I have recognized that this alone is not enough: With Python 3, identifiers can contain non ascii characters, implying that in the result of make_query(x=v) not only v but also x might be encoded with a character encoding. I am currently thinking about a new family of controls :key-<charset> or a single new control :key-encoded (fixing the encoding to utf-8), to specify that and how the "key" of a request parameter is encoded.

I also noticed another difficulty: request parameters can have two sources: they can be browser generated following the reaction to a former response (e.g. submitting a form back to the form generating server); or they can be "free standing" (without any direct relation to a former server response), e.g. come from documentation. In the first case, the browser will use the "charset" specified in the response it responds to; in the second case, it is unclear what "charset" was employed for the url construction -- maybe, it has followed the uri specification which says "charset should be utf-8".

As far as I see, HTTPRequest.processInputs cannot distinguish the 2 cases. It may well be that my aim to make make_query independent from the "default_encoding" is unachievable.

d-maurer avatar May 19 '19 09:05 d-maurer

I have a different interpretation of what processInputs is supposed to do. It is supposed to marshal input values into something that later processing code can reliably work with.

When you specify a form value character encoding you're helping the process by making sure the input value can be decoded reliably for processing. But keeping a byte string in a character set that is not the configured default is not helpful because the information what that encoding is will be lost after processInputs. Any code that attempts to decode it will run into issues. The only way to allow predictable processing further down the line is to use a common encoding, as represented by the configured default encoding.

So I stand by my earlier diagnosis: It's working correctly in Python 2. Python 3 is broken right now.

dataflake avatar May 19 '19 15:05 dataflake

Jens Vagelpohl wrote at 2019-5-18 12:35 -0700:

I have stepped through the code under Python 2 and it seems to do everything correctly. The input gets decoded using the character set you're explicitly passing in, and then re-encoded using the default ZPublisher encoding. The result on Zope 4 is a correctly encoded UTF-8 binary string. Do you expect something different?

I have looked at the Zope2 code:

                            if character_encoding:
                                # We have a string with a specified character
                                # encoding.  This gets passed to the converter
                                # either as unicode, if it can handle it, or
                                # crunched back down to latin-1 if it can not.
                                item = unicode(item,character_encoding)
                                if hasattr(converter,'convert_unicode'):
                                    item = converter.convert_unicode(item)
                                else:
                                    item = converter(
                                        item.encode(default_encoding))
                            else:
                                item = converter(item)

The comment states that the "specified character encoding" is the character encoding for the respective item. This seems to be in line with my interpretation: the passed in value is encoded using this "specified character encoding" (and not the zpublisher_default_encoding).

Converters with a "convert_unicode" (those are for Zope 2 the "u*" converters) correctly use the value decoded with this character encoding. That is the reason why my :utf-8:ustring works correctly in Plone (with its zpublisher-default-encoding = utf8).

The comment states further that for other converters, the (correctly) decoded value is "latin-1" encoded. This would allow that bytes input (as in my example) is processed correctly. However, the following code has replaced the "latin-1" from the comment by "default_encoding".

Now to my concrete example in Zope2: Zope2 (unlike "Zope4") does not define a ":bytes" converter. An unknown ":" is silently ignored. If there is no converter, a (recognized) character encoding is ignored as well. Thus, in my example, the value is used as it is -- which accidentally is the correct use. Thus, by accident, my example works in Zope 2 -- independent of the defined zpublisher-default-encoding.,

Let's now speak about Zope 4. Of course, its new ":bytes" converter should be able to accept a bytes string. The most natural way to represent the bytes string bstr as text (i.e. unicode for Python 2, str for Python 3) is bstr.decode('latin1').

I conclude from that:

  1. the new ":bytes" converter should support "text" as input (to be Python2/Python3 compatible). For Zope4+Py2, this would mean that it gets a convert_unicode method.

  2. if the converter gets a "text" (rather than bytes) input, it should interpret it as a latin-1 decoded bytes string because that is the most natural representation of a bytes string as text) and raise an exception if this interpretation fails

This would solve the passing of byte strings around via (free standing) urls. It would not answer the question how to pass byte strings across browser interactions (by the intermediary of a hidden form field) because the zpublisher-default-encoding in effect may not allow the encoding of the the latin-1 decoded byte string. Nor would it support ":bytes" form fields.

I have checked how zope.schema handles this (admittedly) difficult case. Its approach is drastic: the value of a Bytes field is effectively restricted to ASCII.

d-maurer avatar May 20 '19 07:05 d-maurer

Quick update after some more debugging: Under Python 3 the latin1-encoded values are broken during the creation of the FieldStorage object:

https://github.com/zopefoundation/Zope/blob/57da264fbcf1c0577470231208683b9d1b38de60/src/ZPublisher/HTTPRequest.py#L514-L515

FieldStorage knows nothing about the special marshaling identifier :latin1 and assumes every value is encoded or encodable in the encoding passed for the request as a whole. So the returned FieldStorage contains broken values for anything that's encoded differently than the default (UTF-8).

So it appears using encoding identifiers other than the default UTF-8 is badly broken under Python 3, and fixing it is not trivial. If feels like we either need to override a whole bunch of the underlying cgi.FieldStorage to add intelligence about those encoding flags, or pre-process all fields ourselves and re-encoding all values that don't conform to the default encoding.

dataflake avatar May 25 '19 17:05 dataflake

Jens Vagelpohl wrote at 2019-5-25 10:50 -0700:

... FieldStorage knows nothing about the special marshaling identifier :latin1 and assumes every value is encoded or encodable in the encoding passed for the request as a whole. So the returned FieldStorage contains broken values for anything that's encoded differently than the default (UTF-8).

The WSGI spec shows the route to follow: when constructing the "FieldStorage", do not use "default_encoding" but "latin-1". The use of "latin-1" does not mean that the data is "latin-1" encoded text - it is a purely technical way to map the sequence of bytes (the type of an url part after url-decoding) to an str (by mapping byte "b" to the unicode character with "b" as code point). Recode later, once you know with correct encoding to apply.

This way, you get the equivalent result for Python 2 (where the url part is delivered as a sequence of bytes) and Python 3 (where you get an str).

So it appears using encoding identifiers other than the default UTF-8 is badly broken under Python 3,

"utf-8" will help only, if it is used as zpublisher-default-encoding).

Currently, Zope 4 under Python 3 can only handle zpublisher-default-encoding reliably.

and fixing it is not trivial.

It is not difficult -- as outlined above.

I plan to present soon a ZEP (= "Zope Enhancement Proposal") for a modernized request parameter handling. It should (among others) contain answers for the encoding questions.

d-maurer avatar May 26 '19 08:05 d-maurer

Dieter, I have seen you mention several times the use of latin-1 as a "safe" way to decode bytes strings that may be in different encodings. I had never heard about this before. Is this practice documented somewhere?

dataflake avatar May 26 '19 15:05 dataflake

Jens Vagelpohl wrote at 2019-5-26 08:30 -0700:

Dieter, I have seen you mention several times the use of latin-1 as a "safe" way to decode bytes strings that may be in different encodings. I had never heard about this before. Is this practice documented somewhere?

In the WSGI spec (as mentioned).

It is also straight forward to see: a byte is an integer between 0 and 255; the "latin-1" charset consists of the code points 0 to 255 of the unicode character set. This means that there is a natural one to one correspondence between bytes and the "latin-1" character set.

An url is after url decoding a sequence of bytes; decoding it further using the "latin-1" encoding gives (under Python 3) an str directly representing this byte seqeunce. This allows to use other information (either an explicitly specified encoding or some default encoding) to recode with the (hopefully) real encoding.

When you think about it, you will see that this resembles what happens with Python 2: there, you get this same byte sequence as a native string. Using the "latin-1" charset gives you the same in Python 3.

d-maurer avatar May 26 '19 16:05 d-maurer

Jens Vagelpohl wrote at 2019-5-26 08:30 -0700:

Dieter, I have seen you mention several times the use of latin-1 as a "safe" way to decode bytes strings that may be in different encodings. I had never heard about this before. Is this practice documented somewhere?

"https://infra.spec.whatwg.org/#isomorphic-decode" names this "isomorphic {en|de}coding".

It is used, e.g., in the URL standard ("https://url.spec.whatwg.org/").

d-maurer avatar May 27 '19 08:05 d-maurer

This is fixed in Zope 5.9.x.

d-maurer avatar Feb 27 '24 08:02 d-maurer