py-earth
py-earth copied to clipboard
Forward pass adds same linear basis function
I'm trying to run a pyearth model with enable_pruning = False and only linear features with a max_degree of 1. I noticed that the same feature is added twice. See results below:
-------------------------------------------------------------------------
iter parent var knot mse terms gcv rsq grsq
-------------------------------------------------------------------------
0 - - - 124390.433112 1 125669.504 0.000 0.000
1 0 11 -1 58269.498543 2 60407.652 0.532 0.519
2 0 12 -1 46309.037751 3 49280.000 0.628 0.608
3 0 9 -1 43489.795470 4 47522.247 0.650 0.622
4 0 10 -1 40593.709662 5 45564.586 0.674 0.637
5 0 8 -1 38274.623612 6 44146.607 0.692 0.649
6 0 1 -1 36957.981690 7 43820.303 0.703 0.651
7 0 7 -1 34190.204492 8 41688.582 0.725 0.668
8 0 12 -1 33603.237796 9 42151.901 0.730 0.665
-------------------------------------------------------------------------
I noticed issue #135 which seems to suggest the same bug, addressed by @jcrudy .
I'm installing from the latest commit git+https://github.com/scikit-learn-contrib/py-earth.git@b209d1916f051dbea5b142af25425df2de469c5a#egg=sklearn-contrib-py-earth
@CherylCB Thanks for reporting this. I will look into it as soon as I can, which might be a while. It's annoying, but if this bug is causing problems for you the best workaround might be to add your own code to prune any duplicate basis functions. If you want to do that and need guidance on where to start, please comment here and I can elaborate.
It seems you're using py-earth essentially as a variable selection mechanism for linear regression. Is that right? If so, you could also just extract the set of selected variables and use them with sklearn.LinearRegression.
Thanks for your answer @jcrudy. For now indeed I have added my own code to workaround the problem of adding duplicate basis functions.
@CherylCB Glad you were able to work around this bug. Please feel free to post code for your workaround in this thread if it's shareable. It might help someone out later. I'll be leaving this issue open until it's fixed.
@jcrudy I have some time coming days to work on this bug, do you have any suggestions on what would be the first place for me to look?
@CherylCB That's great. I'll take all the help I can get. Going off of the current master, I'd start by looking here. As you'll see, I wrote some special code to try to prevent exactly what you are seeing. One of two possible things is probably happening:
-
The
has_linearmethod in that line is not correct. Thehas_linearmethod is defined as part of theBasisFunctionabstract class, which you'll find in _basis.pyx. It callslinear_in, which has different implementations for the different subclasses. Possibly something in this logic is wrong. -
The forward pass is not using
variable_can_be_linearcorrectly in some specific situations. If this is what's happening, you'll have to go over the logic of the forward pass and see where the linear basis function is being added in your example, then figure out how to prevent it from happening. If you find any clues here they could be very helpful for me, even if you're not sure how to fully solve the problem.
You're actually in a good position to figure this out, since you have an example data set that shows the problem. I'd suggest you try to debug what's happening in the forward pass using a script that fits a model to your data set. Unfortunately, it's hard to set up a debugger to work with cython. Perhaps you're more skilled than me in this area, but if not I suggest you just use print statements in the cython files.
The workflow is something like this:
- Clone the py-earth repo
- cd into the repo and build the code using: python setup.py build_ext --inplace --cythonize
- cd back out of the repo. Put your test script in the same directory as the repo (not inside the repo)
- Make sure your python environment doesn't have py-earth installed.
- Now when you run your script, it should import pyearth from the repo.
- Whenever you change a cython file (such as to add print statements, for example), you'll need to run the command from step 2 again.
If you have any problems, don't hesitate to get in touch. You can reply here or email me (my address is on my github profile). You're potentially saving me a lot of time by working on this, so of course I'm very happy to spend some time helping you succeed at it. Good luck!