patsy icon indicating copy to clipboard operation
patsy copied to clipboard

Categorical names (again)

Open jseabold opened this issue 10 years ago • 4 comments

Do we really need, say, the reference level in the Treatment contrast? I'm not sure it adds enough information vs. the complexity it adds to the names to warrant inclusion. Thoughts? AFAICT, it only appears if you specify a reference level. If you specify one, then surely you know what you specified.

[~/]
[7]: dmatrix('~C(A, Treatment)', data=pd.DataFrame([['some really long name'], ['other name'], ['other name']], columns=['A']))
[7]: 
DesignMatrix with shape (3, 2)
Intercept  C(A, Treatment)[T.some really long name]
        1                                         1
        1                                         0
        1                                         0
Terms:
    'Intercept' (column 0)
    'C(A, Treatment)' (column 1)

[~/]
[8]: dmatrix("~C(A, Treatment('some really long name'))", data=pd.DataFrame([['some really long name'], ['other name'], ['other name']], columns=['A']))
[8]: 
DesignMatrix with shape (3, 2)
Intercept  C(A, Treatment('some really long name'))[T.other name]
        1                                                       0
        1                                                       1
        1                                                       1
Terms:
    'Intercept' (column 0)
    "C(A, Treatment('some really long name'))" (column 1)

jseabold avatar May 06 '14 14:05 jseabold

The problem is that as far as patsy is concerned, "C(A, Treatment('some really long name'))" is an opaque blob of arbitrary Python code (which happens to return a special object that patsy knows how to interpret as a categorical column). So I'm pretty hesitant to get into the business of trying to parse code like this to try and guess which parts can be thrown away :-/

On Tue, May 6, 2014 at 3:33 PM, Skipper Seabold [email protected]:

Do we really need, say, the reference level in the Treatment contrast? I'm not sure it adds enough information vs. the complexity it adds to the names to warrant inclusion. Thoughts? AFAICT, it only appears if you specify a reference level. If you specify one, then surely you know what you specified.

[~/] [7]: dmatrix('~C(A, Treatment)', data=pd.DataFrame([['some really long name'], ['other name'], ['other name']], columns=['A'])) [7]: DesignMatrix with shape (3, 2) Intercept C(A, Treatment)[T.some really long name] 1 1 1 0 1 0 Terms: 'Intercept' (column 0) 'C(A, Treatment)' (column 1)

[~/] [8]: dmatrix("~C(A, Treatment('some really long name'))", data=pd.DataFrame([['some really long name'], ['other name'], ['other name']], columns=['A'])) [8]: DesignMatrix with shape (3, 2) Intercept C(A, Treatment('some really long name'))[T.other name] 1 0 1 1 1 1 Terms: 'Intercept' (column 0) "C(A, Treatment('some really long name'))" (column 1)

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/40 .

Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

njsmith avatar May 06 '14 16:05 njsmith

I understand the hesitancy to fix things that aren't really broken, but IMO this a pretty bad usability issue that has come up before.

To be clear, I'm just talking about what's in design_info.column_names I think. Do things rely on this later? Can't the builder retain the reference information without showing it to us all the time? I'm writing a lot of code just to get back sensible names and be able to manipulate DataFrames that have designs built from patsy. E.g., I either have to regex this back to something sensible or things like (actual use case)

X[["C(dialect_region, Treatment('East Central German'))[T.North German]",
 "C(dialect_region, Treatment('East Central German'))[T.West Central German]", 
 ...]]

Typing this kind of stuff out is brutal. Even just being able to leave out the reference would be an improvement IMO.

jseabold avatar May 06 '14 16:05 jseabold

Part of the solution on my end is to make easier variable names, but it doesn't get by the having to type the reference category each time I want to use the name.

jseabold avatar May 06 '14 16:05 jseabold

I totally agree about the usability issue -- I'm not being hesitant to fix things that aren't broken, I'm being hesitant to start writing code to accomplish something that I'm not sure is even possible in principle :-/. Patsy doesn't know what C is, it's just an arbitrary Python function call. In fact Patsy doesn't even know Python syntax, so it doesn't even know there's a function call there...

Of course the best solution would be to have a proper way to represent categorical data (like R's factors) so that dialect_region could know its own reference level and preferred coding scheme and suchlike. In the mean time...

Some possible approaches:

  • Add some sort of generic string-mangling for long names. R has something like this -- in some cases (can't track it down right now), then it starts throwing away spaces and vowels, I think. So you end up with stuff like C(dlctrgnTrtment(EstCntrlGrmn)). I... guess that doesn't really help much. But we could come up with a better one, that say truncates to the unique prefix or something?
  • Add a name= argument to C, which sets an attr on the returned categorical object telling patsy what display name to use? Sort of awkward to write C(dialect_region, Treatment('East Central German'), name='dialect_region'), but at least it would work.
  • ...?

(BTW as a stupid workaround you can avoid typing the reference category by renaming it so it's alphabetically first.)

njsmith avatar May 10 '14 22:05 njsmith