patsy icon indicating copy to clipboard operation
patsy copied to clipboard

Maximum recursion depth error for formulas with more than 485 terms

Open szs8 opened this issue 11 years ago • 14 comments

I am working with a dataframe which has 7000 columns and it turns out that once you go beyond 485 terms, patsy throws a recursion error when going from a formula to a design matrix. Is there a better way of doing this?

Thanks!

In [282]: df = pd.DataFrame(dict(('a' + str(i), np.random.randn(5)) for i in xrange(500)))

In [283]: formula = " + ".join(df.columns)

In [284]: dmatrices(formula, df)

....

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr)
    452                                 "'%s' operator" % (tree.type,),
    453                                 tree.token)
--> 454         result = self._evaluators[key](self, tree)
    455         if require_evalexpr and not isinstance(result, IntermediateExpr):
    456             if isinstance(result, ModelDesc):

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in _eval_binary_plus(evaluator, tree)
    283
    284 def _eval_binary_plus(evaluator, tree):
--> 285     left_expr = evaluator.eval(tree.args[0])
    286     if tree.args[1].type == "ZERO":
    287         return IntermediateExpr(False, None, True, left_expr.terms)

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr)
    452                                 "'%s' operator" % (tree.type,),
    453                                 tree.token)
--> 454         result = self._evaluators[key](self, tree)
    455         if require_evalexpr and not isinstance(result, IntermediateExpr):
    456             if isinstance(result, ModelDesc):

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in _eval_binary_plus(evaluator, tree)
    283
    284 def _eval_binary_plus(evaluator, tree):
--> 285     left_expr = evaluator.eval(tree.args[0])
    286     if tree.args[1].type == "ZERO":
    287         return IntermediateExpr(False, None, True,
    left_expr.terms)

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr)
    448         assert isinstance(tree, ParseNode)
    449         key = (tree.type, len(tree.args))
--> 450         if key not in self._evaluators:
    451             raise PatsyError("I don't know how to evaluate this "
    452                                 "'%s' operator" % (tree.type,),

RuntimeError: maximum recursion depth exceeded in cmp

szs8 avatar Apr 12 '13 13:04 szs8

I guess I can use ModelDesc etc. https://patsy.readthedocs.org/en/latest/expert-model-specification.html

But in any case it might make sense to fail gracefully here.

szs8 avatar Apr 12 '13 14:04 szs8

Huh, fair enough, the parse evaluator does recurse over the parse tree. It hadn't occurred to me that people would want to parse strings with hundreds of terms :-).

I'll think about how fixable that is. In the main time you may prefer in any case to use the programmatic interface for constructing formulas, which bypasses the string parser entirely. See http://patsy.readthedocs.org/en/latest/expert-model-specification.html and in particular the paragraph starting "However, there is also a middle ground...".

In your case I'd do something like

from patsy import ModelDesc, Term, LookupFactor

my_formula = ModelDesc([], [Term(LookupFactor(c)) for c in df.columns]) dmatrix(my_formula, df)

Let me know how it goes, there might be other places where I didn't think scaling through far enough...

On Fri, Apr 12, 2013 at 2:04 PM, NaN [email protected] wrote:

I am working with a dataframe which has 7000 columns and it turns out that once you go beyond 485 terms, patsy throws a recursion error when going from a formula to a design matrix. Is there a better way of doing this?

Thanks!

In [282]: df = pd.DataFrame(dict(('a' + str(i), np.random.randn(5)) for i in xrange(500))) In [283]: formula = " + ".join(df.columns) In [284]: dmatrices(formula, df) .... /Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr) 452 "'%s' operator" % (tree.type,), 453 tree.token)--> 454 result = self._evaluators[key](self, tree) 455 if require_evalexpr and not isinstance(result, IntermediateExpr): 456 if isinstance(result, ModelDesc): /Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in _eval_binary_plus(evaluator, tree**) 283 284 def _eval_binary_plus(evaluator, tree):--> 285 left_expr = evaluator.eval(tree.args[0]) 286 if tree.args[1].type == "ZERO": 287 return IntermediateExpr(False, None, True, left_expr.terms) /Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr) 452 "'%s' operator" % (tree.type,), 453 tree.token)--> 454 result = self._evaluators[key](self, tree) 455 if require_evalexpr and not isinstance(result, IntermediateExpr): 456 if isinstance(result, ModelDesc): /Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in _eval_binary_plus(evaluator, tree**) 283 284 def _eval_binary_plus(evaluator, tree):--> 285 left_expr = evaluator.eval(tree.args[0]) 286 if tree.args[1].type == "ZERO": 287 return IntermediateExpr(False, None, True, left_expr.terms) /Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr) 448 assert isinstance(tree, ParseNode) 449 key = (tree.type, len(tree.args))--> 450 if key not in self._evaluators: 451 raise PatsyError("I don't know how to evaluate this " 452 "'%s' operator" % (tree.type,), RuntimeError: maximum recursion depth exceeded in cmp

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/18 .

njsmith avatar Apr 12 '13 14:04 njsmith

Thanks, I was just about to do that and was trying to figure out how categorical columns would be handled especially if they are integers.

I guess I should have never attempted to build such a huge formula anyways but sometimes you are pig headed and just want to plough forward!

szs8 avatar Apr 12 '13 14:04 szs8

@signalseeker, I recently ran into the same error using statsmodels to build a logistic regression with more than 485 predictors. The data I'm working with has a very large predictor space and, unfortunately, there is nothing to be done about it. Thanks for looking into this, @njsmith.

jm-contreras-zz avatar May 14 '14 13:05 jm-contreras-zz

+1. Trying to run an interaction model '(a1+ a2+ ... a360) * (b1+...b40)' works, but '(a1+ a2+ ... a500) * (b1+...b40)' breaks :-(

Have to resort to sklearn.preprocessing.PolynomialFeatures

DSLituiev avatar May 20 '16 01:05 DSLituiev

@DSLituiev: so as noted upthread, you can use something like (untested)

from patsy import ModelDesc, Term, LookupFactor
terms = []
for i in range(1, 501):
    for j in range(1, 41):
        # Add an interaction between a{i} and b{j}, like a10:b12
        terms.append(Term((LookupFactor("a" + str(i)), LookupFactor("b" + str(j))))
preparsed_formula = ModelDesc([], terms)
dmatrix(preparsed_formula, dataframe)

This gives you exactly the same thing as the patsy formulas you wrote above; it's just that instead of having to generate a big string and then have patsy parse it, you can go directly to patsy's high-level representation of your data structures.

(And if you want to transform individual items before passing them in, you can replace LookupFactor(...) with something like EvalFactor("np.log(x)") or EvalFactor("C(a10)"), or you can even define a custom factor class -- mostly you just need to implement an eval method that takes a dataframe and returns your factor's values.)

njsmith avatar May 20 '16 02:05 njsmith

That said, I'm not likely to find the time to fix this soon, but it certainly is fixable by replacing the current recursive loop with an equivalent non-recursive loop, and I'd be happy to accept a patch if anyone wants to make one.

njsmith avatar May 20 '16 02:05 njsmith

Thank you! this looks like enough for my application, and I am afraid I am not sufficiently equipped to tinker the source for now.

On Thu, May 19, 2016 at 7:20 PM, Nathaniel J. Smith < [email protected]> wrote:

That said, I'm not likely to find the time to fix this soon, but it certainly is fixable by replacing the current recursive loop with an equivalent non-recursive loop, and I'd be happy to accept a patch if anyone wants to make one.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/issues/18#issuecomment-220501580

DSLituiev avatar May 20 '16 05:05 DSLituiev

Was this ever fixed?

jolespin avatar Aug 03 '18 00:08 jolespin

@jolespin I don't think so.

njsmith avatar Aug 03 '18 02:08 njsmith

I tried doing a mixed effects model with 4000 attributes and it kind of just got stuck and my computer stopped sounding like it was computing anything. Is there a maximum number attributes that can go in a linear model?

jolespin avatar Aug 03 '18 02:08 jolespin

@jolespin This issue is about formulas with lots of terms, like "y ~ x1 + x2 + x3 + x4 + x5 + x6 + ........ + x3999 + x4000", and it causes crashes, not freezes. Your issue sounds like something you should report to the package you're using to do mixed effect models (maybe statsmodels?)

njsmith avatar Aug 03 '18 02:08 njsmith

I got the same issue. Any update on this?

Hoeze avatar Oct 03 '22 15:10 Hoeze

At this point, we are unlikely to fix this in patsy. The issue is resolved in Formulaic, however.

matthewwardrop avatar Oct 09 '22 18:10 matthewwardrop