Dealing with just Python 2 for now, I understand that patsy expects strings. But the data containers might not have this design. So what's the recommended way for handling this? Should we be messing with the data keys under the hood, or should patsy? The only way I can think to handle this (other than statsmodels doing it under the hood) is for patsy to accept unicode but also an encoding so the formula and they data keys can both be encoded correctly. E.g., this fails

import pandas as pd 
import patsy

data = pd.DataFrame({
    u'àèéòù' : np.random.randn(100),
    'x' : np.random.randn(100)})

formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)

But if we also encode the data keys, it's fine. So should dmatrices and whatever other entry points also take an encoding? Am I missing something?

Feb 17 '14 16:02 jseabold

The formula parser depends on the Python lexer, and the python lexer works only on str objects. Does py2 even accept raw utf8 embedded between quote marks in source code? A call to Q(...) is just vanilla python code and subject to its usual constraints. At the very least for this to work you should be writing Q(u'...')...

I'm not sure how much we can really do to fix this within py2. Sure you don't just want to tell people who depend on unicode to upgrade to py3? :-) On 17 Feb 2014 11:32, "Skipper Seabold" [email protected] wrote:

Dealing with just Python 2 for now, I understand that patsy expects strings. But the data containers might not have this design. So what's the recommended way for handling this? Should we be messing with the data keys under the hood, or should patsy? The only way I can think to handle this (other than statsmodels doing it under the hood) is for patsy to accept unicode but also an encoding so the formula and they data keys can both be encoded correctly. E.g., this fails

import pandas as pd import patsy

data = pd.DataFrame({ u'àèéòù' : np.random.randn(100), 'x' : np.random.randn(100)})

formula = u"Q('àèéòù') ~ x".encode('utf-8') dmatrices = patsy.dmatrices(formula, data=data)

But if we also encode the data keys, it's fine. So should dmatrices and whatever other entry points also take an encoding? Am I missing something?

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/34 .

Feb 17 '14 19:02 njsmith

I'm just trying to clean up a PR that has been sitting around for a while, and it tries to support unicode. It also dawned on us that we have no tests for unicode formula input, so I imagine it won't quite work for non-ascii characters.

I'll let you go through the permutations, but like I said, AFAICT, this is the only thing that "works." It'd be nice if patsy did it under the hood, so I don't have to decode things on the way back out to return unicode, but you know better than me.

data = pd.DataFrame({
    u'àèéòù'.encode('utf-8') : np.random.randn(100),
    'x' : np.random.randn(100)})
formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)

Feb 17 '14 20:02 jseabold

There's really no reliable way for patsy to somehow reach inside the 'data' object and replace unicode keys with str keys.

Two options that work now with the original DataFrame with unicode keys:

Assumes source code is in utf-8

dmatrices("Q('àèéòù'.decode('utf-8')) ~ x", data=data)

Works in general

dmatrices(u"Q(u'àèéòù') ~ x".encode("unicode-escape"), data=data)

Neither of these gives you nice term names, but that seems impossible AFAICT.

Some things that patsy could do:

If it receives a unicode string on py2, automatically call .encode("unicode-escape") on it, so that if you're very careful to write your formulas exactly in the form u"Q(u'àèéòù') ~ x", then they'll work.
If it tries to look up a variable name and finds that it doesn't exist, then try calling .decode("utf-8") on the variable name and try again.

I'm really reluctant to implement either of these because they're both really horrible hacks that don't really solve the problem at all. OTOH switching to py3 is a clean solution that just works...

On Mon, Feb 17, 2014 at 3:25 PM, Skipper Seabold [email protected]:

I'm just trying to clean up a PR that has been sitting around for a while, and it tries to support unicode. It also dawned on us that we have no tests for unicode formula input, so I imagine it won't quite work for non-ascii characters.

I'll let you go through the permutations, but like I said, AFAICT, this is the only thing that "works." It'd be nice if patsy did it under the hood, so I don't have to decode things on the way back out to return unicode, but you know better than me.

data = pd.DataFrame({ u'àèéòù'.encode('utf-8') : np.random.randn(100), 'x' : np.random.randn(100)}) formula = u"Q('àèéòù') ~ x".encode('utf-8') dmatrices = patsy.dmatrices(formula, data=data)

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/34#issuecomment-35319605 .

Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Feb 18 '14 00:02 njsmith

On Mon, Feb 17, 2014 at 7:10 PM, njsmith [email protected] wrote:

There's really no reliable way for patsy to somehow reach inside the 'data' object and replace unicode keys with str keys.

Two options that work now with the original DataFrame with unicode keys:

Assumes source code is in utf-8

dmatrices("Q('àèéòù'.decode('utf-8')) ~ x", data=data)

Works in general

dmatrices(u"Q(u'àèéòù') ~ x".encode("unicode-escape"), data=data)

Neither of these gives you nice term names, but that seems impossible AFAICT.

Some things that patsy could do:

If it receives a unicode string on py2, automatically call .encode("unicode-escape") on it, so that if you're very careful to write your formulas exactly in the form u"Q(u'àèéòù') ~ x", then they'll work.

If it tries to look up a variable name and finds that it doesn't exist, then try calling .decode("utf-8") on the variable name and try again.

Yeah, we tried both and the latter was my "solution" given that it's easier on users. Why is this not reliable?

I'm really reluctant to implement either of these because they're both really horrible hacks that don't really solve the problem at all. OTOH switching to py3 is a clean solution that just works...

I agree that it's completely a least worst solution, and I understand if you don't want to implement it. I'm just not sure I see the harm in trying a fallback except from a code purity standpoint. We'll likely have to do this more systematically, if we continue to have PRs from international users, which means the hack goes up a level and likely has to touch more code.

Feb 18 '14 00:02 jseabold

If you just put unicode characters into a string literal in py2, what even happens? Don't they end up encoded in the user's locale charset or something? I just don't understand enough about this to know if or why or when using the utf8 decode back would even work. On 17 Feb 2014 19:23, "Skipper Seabold" [email protected] wrote:

On Mon, Feb 17, 2014 at 7:10 PM, njsmith [email protected] wrote:

There's really no reliable way for patsy to somehow reach inside the 'data' object and replace unicode keys with str keys.

Two options that work now with the original DataFrame with unicode keys:

Assumes source code is in utf-8

dmatrices("Q('àèéòù'.decode('utf-8')) ~ x", data=data)

Works in general

dmatrices(u"Q(u'àèéòù') ~ x".encode("unicode-escape"), data=data)

Neither of these gives you nice term names, but that seems impossible AFAICT.

Some things that patsy could do:

If it receives a unicode string on py2, automatically call .encode("unicode-escape") on it, so that if you're very careful to write your formulas exactly in the form u"Q(u'àèéòù') ~ x", then they'll work.

If it tries to look up a variable name and finds that it doesn't exist, then try calling .decode("utf-8") on the variable name and try again.

Yeah, we tried both and the latter was my "solution" given that it's easier on users. Why is this not reliable?

I'm really reluctant to implement either of these because they're both really horrible hacks that don't really solve the problem at all. OTOH switching to py3 is a clean solution that just works...

I agree that it's completely a least worst solution, and I understand if you don't want to implement it. I'm just not sure I see the harm in trying a fallback except from a code purity standpoint. We'll likely have to do this more systematically, if we continue to have PRs from international users, which means the hack goes up a level and likely has to touch more code.

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/34#issuecomment-35335698 .

Feb 25 '14 04:02 njsmith

Yikes, just had the same problem: I had a big list column names, which were in unicode and then constructed the formula like formula = "%s ~ %s" % (depended, " + ".join(independent) which resulted in a unicode formula as one of the column names was unicode. This resulted in PatsyError: model is missing required outcome variables :-(

If there is no proper solution: please make this more obvious by e.g. warning in _do_highlevel_design if formula is a unicode string or one of the columns...

Jul 10 '15 10:07 jankatins

Has there been any progress on this? Looking back through the comments here, I don't see an explanation of why patsy requires bytestrings in the first place.

Mar 20 '16 04:03 BrenBarn

Patsy does at least provide a more sensible/detailed error messages now: https://github.com/pydata/patsy/blob/master/patsy/highlevel.py#L49-L60

@BrenBarn: unfortunately, the bytestring requirement on py2 is baked into the language itself: patsy formulas contain python code, and on python 2, python code is bytestrings (specifically, if you try passing unicode to the tokenize module, it errors out, and patsy relies on this module). Not much I can do about it :-(. There's a bit about this in the manual: https://patsy.readthedocs.org/en/latest/py2-versus-py3.html

Mar 20 '16 06:03 njsmith

how to deal with unicode?

Assumes source code is in utf-8

Works in general

Assumes source code is in utf-8

Works in general

Assumes source code is in utf-8

Works in general