Roadmap.jl
Roadmap.jl copied to clipboard
Unify the efforts for Regression/GLM
Regression (e.g. linear regression, logistic regression, poisson regression, etc) is a very important in machine learning. Many problems can be formulated in the form of (regularized) regression.
Regression is closely related to generalized linear models. A major portion of regression problems can be considered as estimation of generalized linear models (GLM). In other words, estimating a GLM can be casted as a regression problem where the loss function can be considered as the negative log-likeliehood.
There have been a few Julia packages in this domain. Just to name a few, we already have:
and probably some others that I am missing.
Functionalities provided by these packages are substantially overlapped. Yet, they are not working with each other.
Unifying these efforts towards a great framework for regression/GLM would definitely make Julia a much more appealing option for machine learning. I am opening this thread to initiate the discussions.
Below is a proposal about how we may proceed:
-
Front-end and Back-end should be decoupled. To me, a regression framework consists of four basic aspects:
- Data: observed features/attributes and responses
- Model: an entity that can be used to make predictions given new observations. A model should contain the coefficients for producing linear predictors (when data are given) and information about how to link predictors to responses.
- Problem: given data, estimating model parameters can be casted as an optimization problem with a certain objective function (and optionally some constraints).
- Algorithm: the procedure to solve a given problem
The front-end modules should provide functions to help users turn their data and domain-specific knowledges into optimization problems; while the back-end should focus on solving the given problems. These two parts require different skills (the former is mainly concerned with user API design; while the latter is mainly about efficient optimization algorithms).
-
I propose the following way to reorganize packages:
-
RegressionBase.jl
: provide types to represent regression problems and models. This package should also provide other facilities to express a regression problems, e.g. loss functions, regularizers, etc. This package can also provide some classical/basic algorithms to solve a regression problem. (This may more or less adopt whatRegERM
is doing). -
GLMNet.jl
(depend onRegressionBase.jl
): wrap the external glmnet library to provide efficient solvers for certain regression problems. The part that depend onDataFrames
should be separated out. - Similarly,
SGD.jl
,LARS.jl
, etc should also depend onRegressionBase.jl
and provide different kind of solvers. Note thatGLMNet
,SGD
,LARS
should accept the same problem types, and have consistent interface. They just implement different algorithms. -
Regression.jl
: a meta package that includeRegressionBase.jl
and a curated set of solver packages (e.g.GLMNet
,SGD
, etc) -
GLMBase.jl
(depend onDistributions.jl
andRegression.jl
): provide types to represent generalized linear models, relevant machinery such as link functions, etc. This package can take advantage ofRegression.jl
for model estimation. -
GLM.jl
(depend onGLMBase.jl
andDataFrames.jl
): provide high level UI for end users to perform data analysis. (The user interface can remain the same as it is).
-
Your suggestions and opinions are really appreciated.
The first question that we need to answer is whether we should introduce RegressionBase.jl
(which would borrow part of the stuff from RegERM
). If there's no objection, I can setup this package and then we can discuss interface designs from there.
We can then proceed with the adjustment of other packages.
cc: @dmbates @johnmyleswhite @simonster @BigCrunsh @scidom @StefanKarpinski
I like the design/choices of abstraction!
I very much like this idea.
On Sat, Jul 19, 2014 at 11:33 PM, Dahua Lin [email protected] wrote:
Regression (e.g. linear regression, logistic regression, poisson regression, etc) is a very important in machine learning. Many problems can be formulated in the form of (regularized) regression.
Regression is closely related to generalized linear models. A major portion of regression problems can be considered as estimation of generalized linear models (GLM). In other words, estimating a GLM can be casted as a regression problem where the loss function can be considered as the negative log-likeliehood.
There have been a few Julia packages in this domain. Just to name a few, we already have:
- GLM https://github.com/JuliaStats/GLM.jl
- GLMNet https://github.com/simonster/GLMNet.jl
- Regression https://github.com/lindahua/Regression.jl
- NLReg https://github.com/dmbates/NLreg.jl
- RegERM https://github.com/BigCrunsh/RegERMs.jl
- SVM https://github.com/JuliaStats/SVM.jl
- LIBSVM https://github.com/simonster/LIBSVM.jl
- SGD https://github.com/johnmyleswhite/SGD.jl
- Loss https://github.com/johnmyleswhite/Loss.jl
- LARS https://github.com/simonster/LARS.jl
and probably some others that I am missing.
Functionalities provided by these packages are substantially overlapped. Yet, they are not working with each other.
Unifying these efforts towards a great framework for regression/GLM would definitely make Julia a much more appealing option for machine learning. I
am opening this thread to initiate the discussions.
Below is a proposal about how we may proceed:
Front-end and Back-end should be decoupled. To me, a regression framework consists of four basic aspects: - Data: observed features/attributes and responses - Model: an entity that can be used to make predictions given new observations. A model should contain the coefficients for producing linear predictors (when data are given) and information about how to link predictors to responses. - Problem: given data, estimating model parameters can be casted as an optimization problem with a certain objective function (and optionally some constraints). - Algorithm: the procedure to solve a given problem
The front-end modules should provide functions to help users turn their data and domain-specific knowledges into optimization problems; while the back-end should focus on solving the given problems. These two parts require different skills (the former is mainly concerned with user API design; while the latter is mainly about efficient optimization algorithms). -
I propose the following way to reorganize packages: -
RegressionBase.jl: provide types to represent regression problems and models. This package should also provide other facilities to express a regression problems, *e.g.* loss functions, regularizers, etc. This package can also provide some classical/basic algorithms to solve a regression problem. (This may more or less adopt what RegERM is doing). - GLMNet.jl (depend on RegressionBase.jl): wrap the external *glmnet* library to provide efficient solvers for certain regression problems. The part that depend on DataFrames should be separated out. - Similarly, SGD.jl, LARS.jl, etc should also depend on RegressionBase.jl and provide different kind of solvers. Note that GLMNet, SGD, LARS should accept the same problem types, and have consistent interface. They just implement different algorithms. - Regression.jl: a meta package that include RegressionBase.jl and a curated set of solver packages (*e.g.* GLMNet, SGD, etc) - GLMBase.jl (depend on Distributions.jl and Regression.jl): provide types to represent generalized linear models, relevant machinery such as link functions, etc. This package can take advantage of Regression.jl for model estimation. - GLM.jl (depend on GLMBase.jl and DataFrames.jl): provide high level UI for end users to perform data analysis. (The user interface can remain the same as it is).
Your suggestions and opinions are really appreciated.
The first question that we need to answer is whether we should introduce RegressionBase.jl (which would borrow part of the stuff from RegERM). If there's no objection, I can setup this package and then we can discuss interface designs from there.
We can then proceed with the adjustment of other packages.
cc: @dmbates https://github.com/dmbates @johnmyleswhite https://github.com/johnmyleswhite @simonster https://github.com/simonster @BigCrunsh https://github.com/BigCrunsh @scidom https://github.com/scidom @StefanKarpinski https://github.com/StefanKarpinski
— Reply to this email directly or view it on GitHub https://github.com/JuliaStats/Roadmap.jl/issues/14.
Alright, now I see that it is a issue in the repository. I might suggest putting MixedModels within this framework too. Linear mixed models, generalized linear mixed models and nonlinear mixed models are all in the regression model family.
Great Initiative! I agree with this abstraction. But your argumentation holds in general for all regularized empirical risk minimization approaches. Is it necessary to restrict the base package RegressionBase.jl
to regression and exclude classification methods like SVMs and logistic regression? So I would slightly extend / modify your suggestion:
- instead of
RegressionBase.jl
I would propose to have a more generalRegERMs.jl
- like base package: provide types to represent regularised empirical risk problems and models. This package should also provide loss functions, regularizers, kernels, etc. -
Classification.jl
in addition toRegression.jl
: Both could be umbrellas for empirical risk instances (SVMs, LogReg, MatrixFactorization, RidgeRegression,...) as well as other prediction approaches. Minor Concern: I am not sure yet how to handle methods which can be used for both like decision trees.
I am totally fine with using RegERMs.jl
for that purpose and see how we can adapt it and the affected methods. This wiki might also be helpful for the interfaces.
cc: @gusl
@BigCrunsh In my mind, the term Regression
can be considered in a quite general sense, that is, to put it more intuitively, optimizing a sum of loss over given data plus a regularization term in some form.
If you don't mind, we can just use RegERMs
as the basis and go from there. If you agree with this idea, what about moving the package RegERMs
to StatsBase
?
sounds good.
@BigCrunsh: I have added you as one the owners of JuliaStats, so you have privilege to move packages here.
Generally, I support the idea of more standardized APIs and unification of our many regression packages. Some more specific comments below.
It would be great if our API supported fitting multiple dependent variables in some way, either explicitly or by offering a fit!
method that reuses the factorization of the design matrix when applicable; see JuliaStats/StatsBase.jl#83.
L1 solvers are often used to fit many models spanning the entire regularization path because 1) fitting the entire regularization path is often not much more computationally expensive than fitting a single model (esp. for LARS, which has to fit the preceding part of the regularization path anyway) and 2) the regularization parameters are typically selected by cross validation, so knowledge of the entire regularization path is useful. We should thus have a standardized API for holding the regularization path and performing cross validation. Perhaps we should support the same for ridge, although the standard Cholesky algorithm doesn't benefit as much from fitting the entire regularization path and generalized cross validation is often used in place of ordinary cross validation.
As far as a high-level interface for fitting models goes, as of JuliaStats/DataFrames.jl#571, you can fit any model that defines StatsBase.fit(Type{T<:RegressionModel}, X, y)
as StatsBase.fit(Type, y ~ x1 + ... + xn, df)
and it will wrap the resulting model so that coeftable
prints the proper labels for the predictors and other methods that are applicable to RegressionModel
are passed through to the underlying model object. There is still some work to be done here: it should be possible to call predict
on a DataFrame, and I will investigate wrapping functions defined only for a specific model object. (Right now a DataFrameRegressionModel
supports only the methods defined on RegressionModel
in StatsBase.) I'm also not entirely sure what the low-level API should look like for MixedModels. But in general, I think this is a good way to split out the high-level API from the code that fits models and avoid making the low-level packages depend on DataFrames.
I have nothing to add except that this is my favorite issue in a long time. (Besides the "Can" issue.)
I'm hoping to receive comments / edits on the wiki, and that this document will evolve into the standard interface doc for Statistics / ML models.
On Sunday, July 20, 2014, Stefan Karpinski [email protected] wrote:
I have nothing to add except that this is my favorite issue in a long time. (Besides the "Can" issue.)
— Reply to this email directly or view it on GitHub https://github.com/JuliaStats/Roadmap.jl/issues/14#issuecomment-49557914 .
Gustavo Lacerda http://www.optimizelife.com
Sorry for the late reply, I am on holidays for the next couple of weeks thus the delay. This is a great initiative as I favor the proposed abstraction and unification for regression models. More generally, I favor the unification of model specification across packages as discussed in PGM
. I view the intended codebase for regression models as a first step towards this collaborative direction (if I am not mistaken regression can be expressed as a factor graph?)
@gusl thanks for creating the wiki.
I am not completely sure that can be a common interface that work for all statistical models. Generally, generative Bayesian network, discriminative models, Markov random fields, time series, stochastic processes -- most of these can be called statistical models. I can't imagine one interface that can fit them all. For example, a Bayesian network may involve multiple variables, not just x and y, that are related to each other in a complicated way; while a time series model need to be updated over time.
I think it is more pragmatic to consider interface designs for individual family of models. Within this restricted context, many of your proposals do make a lot of sense.
This issue, in particular, focuses on a common family of problems -- regression analysis. Generally, regression analysis aims to find relations between dependent variables (also known as responses) and independent variables (e.g. features/attributes). A typical classification problem can be considered as a special case of regression problem that try to find relations between the features and the class labels.
From a mathematical standpoint, a regression problem can be formalized in two ways:
-
(Regularized) empirical risk minimization. This is an optimization problem that usually involves two parts, namely loss terms and regularization terms, as
minimize \sum_i w_i f(x_i, y_i; \theta) + r(\theta)
Here,
f
is the loss function,r
is the regularizer. -
conditional distribution. This is to formulate the relations between
x
andy
as a conditional distribution, asp(y | x)
. One can also impose a prior over the parameter\theta
asp(\theta)
. MAP estimation of the parameter\theta
can be casted as a risk minimization problem as above, while the loss isf(x, y) = - log p(y | x)
, and the regularizer isr(\theta) = - \log p(\theta)
.
Generalized Linear Model is a special case of the regression analysis problem as outlined above, where the dependent variable y
is connected to the independent variables x
in a special form that involves a (possibly nonlinear) link function and a distribution over responses. This kind of formulations, while having a restricted form, are incredibly flexible. Many important regression problems, notably linear regression, logistic regression, and poisson regression, belong to this family.
A generalized linear model can be estimated in two ways: (1) cast to a regularized risk minimization problem; or (2) use algorithms dedicated to GLMs.
Conceptually, all these things can be divided into three levels:
-
Solver level: this level concerns about loss functions and regularization. The representation of loss functions and regularizers, as well as basic algorithms, can be mostly implemented in a base package (preferably
RegERM
). Advanced or specific algorithms may be implemented in other packages with a standardized interface, such asGLMnet
,LARS
,GLMNet
,SVM
,SGD
, etc. -
Model level: this level concerns about probabilistic formulation, e.g. evaluate conditional probabilities, likelihoods, and compute useful statistics. This level should go into
GLMBase
andMixedModels
. Note that this level makes it possible to incorporate regression models into a bigger probabilistic framework, e.g. hierarchical mixture of experts, etc. -
Semantics level: this level concerns about assigning semantic interpretation to the results. At this level, each variable may be associated with a semantic meaning (e.g.
temperature
,speed
,duration
, etc). The mathematical machinery at lower levels can be combined with a semantic context (e.g.DataFrames
) to achieve this goal. Major tasks at this level: (1) turn given user inputs & data into a lower level form that math algorithms can operate on; (2) invoke a proper solver/algorithm; (3) deliver the results to user in a meaningful way.
A major principle in modern software engineering is separation of concerns. This principle also applies here. I can imagine that different groups of developers (of course these groups may overlap in reality) may focus on different levels:
- Solver level: people interested in optimization or machine learning algorithms.
- Model level: people interested in statistics or probabilistic modeling.
- Semantic level: people interested in data mining. end users.
Particularly, people who implement solvers or machine learning algorithms should not be concerned about things like data frames etc. It is the responsibility of the higher level packages to convert data frames into a problem in standardized forms (that only involve numerical matrices and vectors).
I hope this further clarify my thoughts.
to @scidom: the model level of this formulation (as outlined above) can be seen as a factor in a probabilistic graphical model, and thus can be incorporated in a larger probabilistic framework.
My experience with developing Distributions and Graphs is that interface may change a lot as opposed to what is planned originally. It would be useful to start building up a package and make changes as necessary as we move forward. We can update the wiki as the API matures.
As to how we may proceed, I think the next would be to start working on the regression codebase (starting from the solver level).
@BigCrunsh: would you please move RegERM
over to JuliaStats
when you are ready? We can work together on detailed design of the API over there.
@lindahua is right about getting started. Look to JuliaOpt for inspiration that it can work, although it was a smaller set of developers. We have a solver level (i.e. a package for each solver wrapper), a generic interface level to all solvers that defines a canonical form (MathProgBase.jl) and then currently one modeling interface (JuMP.jl, although CVX.jl will join this soon).
I looked at the codes in RegERM
, and believe it is going along the right direction. The package has already provided some basic infrastructure, such as Loss
, Regularizer
, and some types to represent regression problems.
We probably need to enrich that system with more discussions. However, I think it is already a good starting point.
This breakdown into the solver, model and semantics levels is very good. It might be a bit premature, but I find that making the names of things line up with the concepts can be very helpful to get everyone on the same page conceptually. (This is why I'm so picky about naming.) Perhaps there should be packages named RegressionSolvers
, RegressionModels
and RegressionInterfaces
? Perhaps not all of the code goes in there, but it seems like there will need to be common base types that can live there.
@StefanKarpinski These names would be useful as abstract types. This whole thing involves close interaction between these types, hence it would make sense to put this type hierarchy in a foundational package, together with a clear document about how they work. Other packages can extend those or build on top of them.
Originally, I proposed to have a package named RegressionBase
. However, after looking at @BigCrunsh's RegERM
, I think that would be the right place to host these.
On Mon, Jul 21, 2014 at 6:49 AM, Dahua Lin [email protected] wrote:
@gusl https://github.com/gusl thanks for creating the wiki.
I am not completely sure that can be a common interface that work for all statistical models. Generally, generative Bayesian network, discriminative models, Markov random fields, time series, stochastic processes -- most of these can be called statistical models. I can't imagine one interface that can fit them all.
Thanks for bringing up separation of concerns.
For example, a Bayesian network may involve multiple variables, not just x and y, that are related to each other in a complicated way;
The issue with graphical models is that 'fit' can mean many different things.
StatisticalModel e.g. a specific graphical model such as an A -- B -- C Ising model with a free parameter for each edge InferenceGoal e.g. MLE, MAP estimate, posterior approximation by Monte Carlo, etc InferenceAlgorithm (Solver) e.g. Optimization with Interior-Point method, Metropolis-Hastings, etc
(I'm introducing an extra level, between Model and Solver)
My idea is that 'fit' should still be used, with extra arguments that have default values.
e.g. given an instance of the A--B--C Ising Model with a free parameter for each edge, 'fit' would assume by default that you want a MAP estimate with a diffuse prior, but you can also specify that you want to do a posterior approximation, and then it assumes that you want Metropolis-Hastings and will use a standard proposal distribution, but also allows you to pass your own proposal.
while a time series model need to be updated over time.
I would say that fit_more! applies in this case.
I think it is more pragmatic to consider interface designs for individual
family of models. Within this restricted context, many of your proposals do make a lot of sense.
This issue, in particular, focuses on a common family of problems -- regression analysis http://en.wikipedia.org/wiki/Regression_analysis. Generally, regression analysis aims to find relations between dependent variables (also known as responses) and independent variables (e.g. features/attributes). A typical classification problem can be considered as a special case of regression problem that try to find relations between the features and the class labels.
I agree.
From a mathematical standpoint, a regression problem can be formalized in two ways:
(Regularized) empirical risk minimization. This is an optimization problem that usually involves two parts, namely loss terms and regularization terms, as
minimize \sum_i w_i f(x_i, y_i; \theta) + r(\theta)
Here, f is the loss function, r is the regularizer.
Please pardon my ignorance, but my understanding is that since "risk" means expected loss, "empirical risk minimization" means minimizing risk on unseen data (often estimated by using held-out data)... so it sounds broader than the formulation above. (I guess I'm not convinced that a penalty on \theta provides a universal solution to the problem)
- conditional distribution. This is to formulate the relations between x and y as a conditional distribution, as p(y | x). One can also impose a prior over the parameter \theta as p(\theta). MAP estimation of the parameter \theta can be casted as a risk minimization problem as above, while the loss is f(x, y) = - log p(y | x), and the regularizer is r(\theta) = - \log p(\theta).
This sounds like a special case of the above, namely the loss function is the log-likelihood. If you're doing MLE, r(theta) is the zero function.
- Generalized Linear Model http://en.wikipedia.org/wiki/Generalized_linear_model is a special case of the regression analysis problem as outlined above, where the dependent variable y is connected to the independent variables x in a special form that involves a (possibly nonlinear) link function and a distribution over responses. This kind of formulations, while having a restricted form, are incredibly flexible. Many important regression problems, notably linear regression, logistic regression, and poisson regression, belong to this family.
yes, GLMs are super important.
g is said to be the link function if E[Y_i] = g(X_i beta). If g :: Real -> Real, then the GLM is called a "single-index model", because we are summarizing X_i with a single Real number (X_i beta). If g :: Real^2 -> Real, we have a "double-index model", which will be a richer model if g is a truly 2D function. Anyway, GLMs get really cool when we don't specify the functional form of g, i.e. semi-parametric models.
A generalized linear model can be estimated in two ways: (1) cast to a regularized risk minimization problem; or (2) use algorithms dedicated to GLMs.
The "algorithms dedicated to GLMs" hopefully optimize the same objective function as (1).
Conceptually, all these things can be divided into three levels:
Solver level: this level concerns about loss functions and regularization. The representation of loss functions and regularizers, as well as basic algorithms, can be mostly implemented in a base package (preferably RegERM). Advanced or specific algorithms may be implemented in other packages with a standardized interface, such as GLMnet, LARS, GLMNet, SVM, SGD, etc. 2.
Model level: this level concerns about probabilistic formulation, e.g. evaluate conditional probabilities, likelihoods, and compute useful statistics. This level should go into GLMBase and MixedModels. Note that this level makes it possible to incorporate regression models into a bigger probabilistic framework, e.g. hierarchical mixture of experts, etc. 3.
Semantics level: this level concerns about assigning semantic interpretation to the results. At this level, each variable may be associated with a semantic meaning (e.g. temperature, speed, duration, etc). The mathematical machinery at lower levels can be combined with a semantic context (e.g. DataFrames) to achieve this goal. Major tasks at this level: (1) turn given user inputs & data into a lower level form that math algorithms can operate on; (2) invoke a proper solver/algorithm; (3) deliver the results to user in a meaningful way.
A major principle in modern software engineering is separation of concerns. This principle also applies here. I can imagine that different groups of developers (of course these groups may overlap in reality) may focus on different levels:
- Solver level: people interested in optimization or machine learning algorithms.
- Model level: people interested in statistics or probabilistic modeling.
- Semantic level: people interested in data mining. end users.
Particularly, people who implement solvers or machine learning algorithms should not be concerned about things like data frames etc. It is the responsibility of the higher level packages to convert data frames into a problem in standardized forms (that only involve numerical matrices and vectors).
I completely agree with the above.
@lindahua: I already moved RegERMs.jl
to JuliaStats
:wink:
@gusl you touched various matters in your last message. As far as passing a user-defined proposal to the Metropolis-Hastings sampler is concerned, I have thought it last month and have a clear idea on how to do it. In fact I have pretty much completed coding it and once finished, I will merge this generalisation to the MCMC
package; I will do this soon after I return from holidays.
P.S. in fact the structure of MCMC will undergo several already planned changes and refactoring though this goes beyond the scope of the present thread.
I like the idea of solver, model, and semantics level. I agree with @lindahua and @IainNZ, let's get started and perhaps with revising the interfaces in RegERMs.jl
; I started a more detailed discussion over there: https://github.com/JuliaStats/RegERMs.jl/issues/3.
Just one thing, which is probably to earlier, but sooner or later there is a large zoo of solvers and at some point it might be useful to have some benchmarking to derive default choices depending on the number of examples, dimensions, sparsity,...
Thanks @BigCrunsh.
Let's keep the high level discussions (those that affect the reorganization of packages) here. Detailed API design of regression problems should go to https://github.com/JuliaStats/RegERMs.jl/issues/3, as @BigCrunsh suggested.
A general ensemble package would be great to have under the RegressionBase.jl
umbrella. A consistent interface makes developing a such a package very easy.
@svs14 has done a lot of work on the Orchestra.jl package which provides heterogeneous ensemble methods and has it's own API. I don't know the details but it might be a good starting place if the API can be made consistent with RegressionBase.jl
.
I have OnlineLearning.jl which fits GLMs (linear regression, logistic, and quantile regression for now) (optionally with L1 and/or L2 regularization) with SGD. Standard SGD and some variants (ADAGRAD, ADADELTA, and averaged SGD) are implemented. I also started on linear SVMs but the implementation is not done.
I'll keep an eye on JuliaStats/RegERMS.jl#3 and can update the API when that's more fleshed out.
@lendle: Feel free to do that in that framework :wink:
I'd be happy to get a clean version of the newly proposed L0 EM algorithm into the proper format once the regularized regression design/interfaces has been set. For a spike on the L0 EM algorithm see:
https://github.com/robertfeldt/FeldtLib.jl/blob/master/spikes/L0EM_regularized_regression.jl
cc: @lindahua
What happend to this project? Are there any new developments? The idea is really great.
Check this out https://github.com/Evizero/SupervisedLearning.jl