py-earth
py-earth copied to clipboard
Robust Regression
Hi
Would it be possible to add a robust regression option to replace the squared error term. This would allow thew model to be less sensitive to out-liers.
https://en.wikipedia.org/wiki/Huber_loss
Cheers
I like this idea. I've thought a lot about ways to use MARS with different loss functions. The conclusion I've come too is that gradient boosting is the way to do it, similarly to scikit-learn's GradientBoostingRegressor but with MARS basis functions instead of trees. It's on my road map but it's a lot of work to implement, so probably won't happen for some time. Another option would be to refactor scikit-learn's existing implementation of gradient boosting to take a base_estimator like in their AdaBoostRegressor, then pass an Earth model as the base_estimator.
The gradient boosting regression idea seems good, although it does seem like it would be a decent amount of work. I had hoped replacing the suggestion I made whilst not as good would be easier to implement. Do you have any other suggestions for removing out-liers which will overly bias my fit?
Can MARS output the standard deviation at a given point instead of the predicted value? if so i could fit MARS find the standard deviation at a given point and see if any out-liers lie outside the range of say 3 standard deviations. then fit MARS again to the resultant data.
There are a lot of nice ways to detect outliers. The isolation forest idea here might be good. You could remove the outliers before fitting a MARS model, or just give them less weight.
I sometimes use MARS in a pipeline with something else. You could put an Earth model in a pipeline with the GradientBoostingRegressor, for example. You could also just use GradientBoostingRegressor instead of py-earth.
Out of curiosity, what is your reason for wanting to use MARS?
Hi
That does make sense. I might have a look at isolation forest and gradient boosted regression. With regards to using MARS I used to work on N-body simulations of galaxies but recently I moved into industry where I am forecasting gas demand. MARS seemed top be able to: reduce the chance of over-fitting, allow non-linear responses, and allow cross terms. in my case cross term with temperature*wind is a must.
There may well be better methods as I still have a lot to learn and I have been teaching myself. However for me MARS seems quite intuitive. Gradient boosted regression has made me wonder about another issue i have. In my case I may be supplying generator with gas. Maybe said generator turns on between given dates, but when it is on its use is constant. I could of course define a parameter for when its on or not and use it as a flag but i had hoped i could use the day of year as an explanatory variable.
I made a little test to see what MARS would do, (code below) in the example the value is : 10 on points 0-199 and 500-1000 : 20 on points 200-500
It seems I have 2 issues.
- at the change there seems to be very odd behavior, seemingly over and undershooting the range
- as MARS has only 1 intercept, within the different regions it seems to be 'forced' to create gradient fits to accommodate the increase in y
I believe gradient boosted regression would work very well for this example. however if i take most of my portfolio where gas demand looks like a cos wave due to the changes in temperature. i would not want gradient boosted regression to subdivide the data with respect to the date but rather have the relationship defined by the temperature dependence defined by MARS.
Do you have any ideas to solve this?
cheers
import numpy import matplotlib.pyplot as plt from pyearth import Earth
Create some fake data
numpy.random.seed(2) m = 1000
X = numpy.linspace(-40,40,m) y_true= numpy.zeros(m)+ 10 y_true[200:500]=y_true[200:500]+10 y = y_true + 0.5 * numpy.random.normal(size=m)
Fit an Earth model
model = Earth(max_degree=1, verbose=True) model.fit(X, y)
Print the model
#print(model.trace()) #print(model.summary())
Plot the model
y_hat = model.predict(X) plt.figure() plt.plot(X, y, 'r.') plt.plot(X, y_hat, 'b.') plt.plot(X, y_true, 'g.') plt.show()
P.S it occurs to me MARS has this issue because
- there can only be one intercept
- the functions have to be continuous, it would be nice to be able to specify if a term had to be continuous.
Maybe is to use novelty detection, to create a set of flags. Then use these as categorical variables.
Correct. The main difference between MARS and decision trees is that MARS produces continuous models. Things can get a little weird near jumps. If you know that your data has noncontinuous relationships it might be better to use a decision tree based model, which essentially produces piecewise constant functions. As you said, if you have some way of figuring out where the discontinuities are, you can fix this problem using binary categorical inputs.
Below is a reworked version of your example showing one possible idea for a case where you have both continuous and noncontinuous relationships. You'd have to experiment on your own data, of course. The example uses the sklearntools package I linked to before, but you could accomplish the same thing with bare scikit-learn with a little more work. Hopefully it's clear what it's doing and you can run it without too much trouble. To install sklearntools, I use:
pip install git+https://gitlab.com/jcrudy/sklearntools.git
There are some dependencies for sklearntools that you might not expect, such as sympy and autopep8, but hopefully nothing too difficult to set up. Also, this example only works with very recent versions of scikit-learn, probably 0.18.0 or higher. Here is my example:
import numpy
import matplotlib.pyplot as plt
from sklearn.tree.tree import DecisionTreeRegressor
from sklearntools.sklearntools import DelegatingEstimator, standard_methods
from sklearntools.earth import Earth
from sklearntools.calibration import SelectorTransformer
class DecisionPathTransformer(DelegatingEstimator):
'''
Just overrides transform to use decision_path instead. Useful for pipelines.
'''
def __init__(self, estimator):
self.estimator = estimator
self._create_delegates('estimator', standard_methods)
def transform(self, X, exposure=None):
args = {'X': X}
if exposure is not None:
args['exposure'] = exposure
result = self.estimator.decision_path(**args).todense()
if len(result.shape) == 1:
result = result[:, None]
return result
numpy.random.seed(2)
m = 1000
X = numpy.linspace(-40,40,m)
X = X.reshape(m,1)
y_true= numpy.zeros(m)+ 10
y_true[200:500]=y_true[200:500]+10
y = y_true + numpy.cos(X[:,0]) + 0.5 * numpy.random.normal(size=m)
# Do a first pass through a decision tree to find the discontinuities,
# then pass indicators for the different leaves of the tree
# into the Earth model, along with the original data.
model = (DecisionPathTransformer(DecisionTreeRegressor(max_leaf_nodes=3)) \
& SelectorTransformer(slice(None))) \
>> Earth(max_degree=2, max_terms=100, smooth=True, verbose=True, thresh=1e-10)
model.fit(X, y)
y_hat = model.predict(X)
plt.figure()
plt.plot(X, y, 'r.')
plt.plot(X, y_hat, 'b.')
plt.plot(X, y_true, 'g.')
plt.show()
Hey
Thanks for this i think it will prove very useful. I have another question relating to continuous nature in py-earth and the periodic data. So I want to derive an estimate for some weather variables, e.g temperature, wind and solar radiation every day over the next few years. I have about 50 years worth of data.
I used Mars with xdata that was the day of the year, the absolute day ( the current day- the first day in the dataset). I wanted to use absolute day as well so i can correct for any global warming etc. This seems to work great, for temperature I get the sine wave I expected, apart from one thing. Between the last and first day of the year the simulation jumps. I realize this is because it doesn't know that the 365th day and 1st day should be similar. Is there a way i can tell MARS to be periodic over this variable?
Cheers.
P.S. thanks a lot for your help its very useful!
I've considered adding support for cyclic predictors, but haven't yet got around to actually doing it. In your case, I'd suggest just changing the input. For example, instead of day of the year, use min(days since last winter solstice, days until next winter solstice). This isn't as good as true support for cyclic variables, however. If you're not sure the winter solstice is a good zero point, or you think there may be some asymmetry in the year, you could also try some other points and functions and let the algorithm determine which one or combination of them is best. For example, you could have some kind of bump function every 30 days or something.
Hi
The addition of cyclical variables would be a great, I do have quite a few of them. Yeah there does seem to be some asymmetry in response. An idea I did have was doing the sin and cos of the day of the year essentially defining a circle if I pass both these variables in then I hoped the combination would provide something that seems continuous. If there is only symmetry then the cos part would be small. I guess i just wondered if casting the variables in such a way would cause MARS to have difficulty fitting?
Cheers
That seems like a good way to do it to me. I added an issue about cyclic predictors. I'm glad to see I'm no the only one interested. It will likely not happen soon, though, unfortunately.
@Fish-Soup I recently added a general gradient boosting estimator to the sklearntools package. I've so far only experimented with using it for quantile regression and binomial classification with MARS, but in theory you should be able to plug in the Huber loss function from sklearn. As always, sklearntools is an experimental and rapidly changing package used mostly only by me, so be cautious. Here is the relevant module: https://github.com/jcrudy/sklearntools/blob/master/sklearntools/gb.py
You can see some usage examples in the tests: https://github.com/jcrudy/sklearntools/blob/master/sklearntools/test/test_gb.py
Cool I'll look into this.
I've finally had time to look at the example on Jan 19th. It seems to me that the accuracy is very dependent on the tree depth. I'm a bit unsure how to grid search the example you provided. How would i do that?
Cheers
@Fish-Soup You could use sklearntools to do a grid search (you have to manually generate the list of candidates, currently), or just write your own for-loop to do it, or use a scikit-learn pipeline instead of the sklearntools thing and then use scikit-learn grid search. ModelSelectorCV is the sklearntools way to do it, although it isn't nearly as well tested as anything in scikit-learn. I didn't grid search for my example. I decided on a very low depth because essentially I was only looking for discontinuities in single variable responses. I would guess that increasing the max_leaf_nodes very much would start to result in overfitting, which might be what you observe if you grid search it.
Hi. so I manually searched through the leaf number testing with out of sample. what appears to happen is something slightly different to what I imagined. in my test I specify a square wave that changes to a sin wave. it seems that the tree does split the square wave up but it also splits the sin wave. I imagined the tree and mars would be more integrated such that a tree node is specified such that a node is specified as a region where MARS fits the data well. but it appears to me that actually a tree is fit as normal then each node is sent to a separate MARS model. This means that I seem to need quite few nodes to get the square wave fitted well but this splits up the sin period up and makes MARS fit over smaller regions and thus makes it less accurate.
Is there a way to do as I proposed? where each time the simulation is split up to build a leaf node a MARS model is fit and error on the MARS model is used to define the tree.
@Fish-Soup Yeah, I see what you're saying. Definitely a limitation of the proposed method. The alternative you're proposing could be accomplished by modifying the MARS forward pass to also fit a piecewise constant function with a single split and check its performance against the best MARS basis function candidate. However, it's not currently implemented and you'd have to get pretty deep into the py-earth code, including the cython code, to make it happen. In general, I think it could be a good idea to add the capability to pass custom fittable functions to the py-earth fitting process, as they'd be useful for many reasons, but I won't be able to do it any time soon.
If you want to attempt to edit py-earth to build what you're suggesting, I'd be happy to give you some technical support on it. It's hard for me to estimate how big a task it would be for you, though. For me, already knowing the code base very well, it would probably be a week of work to do it right.
modifying the code is something I could look into at some point, but at the moment I don't have time. as you say if it would take you a week it would take me twice as long at least. I like the idea of the peicewise constant function that way the regularization should be able to find if including it is worth it or not. I did find a method of doing this where by you use a tree or possibly DBSCAN to tag data points with categorical data flags. these can then be passed into MARS allowing it to jump. I'm not sure this is totally the best option as choosing the catagorical data flags seems somewhat arbitrary. it would be better if they where linked to be hinge position in MARS
then user defined functions also seemed a good idea. I was mainly thinking about it in terms of my time series data. But I'm sure there are other uses. I did see some work TSMARS which is supposed to be time series mars. But I don't know much about it.
thanks again for the great package.
Hi i went to try implement the https://github.com/jcrudy/sklearntools/blob/master/sklearntools/test/test_gb.py
however the line from sklearntools.gb import GradientBoostingEstimator errors as it cant find .gb module.
i upgraded pip install git+https://gitlab.com/jcrudy/sklearntools.git --upgrade. The final line is: Successfully installed sklearntools-0.1,
From that description it seems i am not getting the most updated version of skleantools. Do I need to check out a different branch?
@Fish-Soup I'm glad you asked. I moved the whole repo to github a while ago, and the gitlab version is not longer up-to-date. Here is the new url: https://github.com/jcrudy/sklearntools
Has anyone implemented confidence intervals the on the splines?
@rsmahabir Not that I know of, but I'm not sure exactly what kind of confidence interval you're talking about. Can you explain in a little more detail what you're looking for?
@jcrudy I was thinking confidence intervals around the different splines as in this simplistic example: https://tomholderness.wordpress.com/2013/01/10/confidence_intervals/. I saw that the 'Earth' package in R does it but its so rigid, and for me Python offers a bit more flexibility to interact with and extract data for modules.
You can see an example of what I think you mean in the pygam model. There you can see the confidence intervals with respect to the input or predicted variables.
@jcrudy thanks for the email, there was actually a bug in my code. I can extract the knots from model.forward_pass_record_
@Fish-Soup Awesome, thank you. I like the pygam package and some of the test figures I've played around with closely mirror the results from MARS after constraining the model. I think the confidence intervals would be a valued addition though. With the other packages in scikit you simply train the model and generate the confidence intervals (CIs). This becomes more difficult in MARS with the penalty factor, since the individual basis function would tend to deviate from your typical OLS type output when this occurs. In my case, I use a penalty value of 22 to generate the general fitting that I need. My initial approach was to use the knots as the end points of the lines, and generate CIs for each segment.
Any ideas how to work with the penalized basis functions to generate confidence intervals?
Thanks again for all the assistance.
@rsmahabir If what you're looking for is a confidence interval for the predicted conditional mean (the output of Earth.predict), it will depend on what kinds of assumptions you're willing to make and how much data you have. I read through your example briefly, and if I understand correctly it seems like there's some kind of normality assumption at play there? I'm not sure I completely understand what they're doing, though. If you could explain it in more detail I might be able to recommend how to implement it, but I'm guessing that if you're willing to make the same assumptions then the method given in your example could be adapted for MARS.
If you want to make fewer parametric assumptions then the bootstrap is an option. However, fitting a MARS model for each bootstrap iteration might take a long time. If you want to do 100 iterations, for example, that's 100 fits. Depending on your data and operational constraints, this may or may not be a problem.
@Fish-Soup It looks like pygam uses the distributional assumptions inherent in GAMs, which MARS does not have. However, I wasn't able to find exactly how pygam calculates those intervals. Do you have any insight here?
Possibly relevant is the distinction between confidence intervals and prediction intervals. It looks like pygam does both? If what you're looking for is the latter, my method for doing that is to use gradient boosting to create a sum of MARS models that gives the necessary conditional quantiles. Like bootstrapping (for confidence intervals), gradient boosting has the disadvantage that it involves many model fits and could take a long time. It is in fact even worse because it can't be trivially parallelized like bootstrapping. I'm happy to share my gradient boosting code. It's decent but isn't exactly production quality yet.
@rsmahabir @Fish-Soup See this nice answer by Stephen Milborrow (author of the R earth package): https://stats.stackexchange.com/a/112110/40731
@jcrudy sadly I don't know know how the intervals are created in pygam.
A possible suggestion for speeding up boostrapping or gradient boosting is to create a hybrid py-earth elastic net model fit the hyper parameters once for elastic net. Then bootrap/gradient boost the elastic net component.
This is like 100 times faster but what it wont do is allow the knot positions to change just the relative slopes of the regions.
@Fish-Soup @jcrudy Sorry for the Houdini act in between, some work obligations had me busy.
@jcrudy it would be cool to see your gradient boosting code in action.
@rsmahabir I've put the boosting code here: https://github.com/modusdatascience/booster.
Unfortunately, there's no documentation yet. To make a 95% prediction interval, I would use it to fit a 2.5th conditional percentile model and a 97.5th conditional percentile model. Here is how you set up the 97.5th percentile model:
from booster.booster import Booster
from booster.loss_functions import SmoothQuantileLossFunction
p = 0.975
loss_function = SmoothQuantileLossFunction(1, p, .0001)
model = Booster(Earth(use_fast=True, max_terms=10, verbose=True),
loss_function, n_estimators=150, verbose=True)
Of course, you could put whatever regressor you want in place of the Earth model, and you would likely have to fine-tune. If you use Earth, I recommend keeping max_terms low as above.