patsy icon indicating copy to clipboard operation
patsy copied to clipboard

Minor (in?)consistency in terms naming with Treatment scheme

Open m-dz opened this issue 4 years ago • 0 comments

We've recently ran into a pretty silly problem with terms naming when using the Treatment scheme, see below:

Imports and data prep.:

import numpy as np
from patsy import dmatrices, dmatrix, demo_data
data = demo_data("a", "b", "x1", "x2", "y", "z column")
  1. Single quotation marks snippet:
dmatrix("C(a, Treatment('a1')) + x1 + x2", data)
  1. Double quotation marks snippet:
dmatrix('C(a, Treatment("a1")) + x1 + x2', data)

Skipping the full printout, 1) gives the following terms' names:

  Terms:
    'Intercept' (column 0)
    "C(a, Treatment('a1'))" (column 1)
    'x1' (column 2)
    'x2' (column 3)

while 2):

  Terms:
    'Intercept' (column 0)
    'C(a, Treatment("a1"))' (column 1)
    'x1' (column 2)
    'x2' (column 3)

This inconsistency in quotation marks used in the output caused some troubles when post-processing/cleaning terms' names etc. I understand the output is consistent with the input, but it might be beneficial to standardise the output here (as in "C(a, Treatment('a1'))" -> 'C(a, Treatment("a1"))').

This seems loosely related to e.g. https://github.com/pydata/patsy/issues/40 with its long categorical names, and if the answer is similar, i.e. better not to fix things that aren't broken, maybe this can at least be mentioned in the docs? Happy to make w PR for that.

Edit: patsy 0.5.1 Python 2.7.5 (I know...)

m-dz avatar Aug 28 '19 11:08 m-dz