RegERMs.jl icon indicating copy to clipboard operation
RegERMs.jl copied to clipboard

Design of a general API for Regression problems

Open BigCrunsh opened this issue 11 years ago • 37 comments

This issue carries on the discussion at https://github.com/JuliaStats/Roadmap.jl/issues/14 with the intend to make things more concrete. To start with this, I see a list of apparent questions and TODOs.

As always your suggestions and opinions are really appreciated:

Interfaces Basic functionality of a regression model should be 1) optimise the model parameters using a set of instances and labels and 2) make a prediction for a set of instances. Perhaps, we should do some renaming, e.g.:

  • Model -> RegressionModel; make it abstract
  • classify -> predict?
  • more comprehensive name for RegERMs?

See also https://github.com/JuliaStats/Roadmap.jl/wiki/Proposal:-interface-for-JuliaStats-packages-(Statistics-and-Machine-Learning).

Instances of RegERMs What should happen with certain instances like SVM, LogReg, RidgeRegression, Lasso,... ?

  • implemented / instantiated in this package
  • moved to subfolder
  • own packages which are based on this package for each of these instances.

loss.jl / regularizer.jl The aim of this package is to combine these concepts into the regularised empirical risk minimisation framework. For now, it makes sense to have them in this package and conflate them with existing approaches like https://github.com/johnmyleswhite/Loss.jl/.

sgd.jl Stochastic gradient descent methods are typical solvers for RegERMs. I am not sure where it fits best and who else uses it. Should it be moved to (or just replaced by) https://github.com/johnmyleswhite/SGD.jl/?

Kernels

  • Currently kernels are only implemented via mercer_map.jl. The concrete kernel functions probably should be moved into a different package, since 1) the concept of kernels is quite general can also be used by e.g., https://github.com/JuliaStats/HypothesisTests.jl and 2) there are a lot of related methods, e.g., for approximating kernels (for efficiency) which are too specific but useful to have in general.
  • mercer_map.jl: computes a finite mapping w.r.t. a given kernel for a fixed training set. It feels like that this implementation of kernels belongs to the model level.

Cross-validation Cross-validation is important for estimating the performance of a given model and for hyper-parameter tuning. The regularisation parameter is actually just another model parameter, which has to be optimised. Thus, procedures like marginal likelihood tuning and cross-validation should be located at the solver level. See, e.g., https://github.com/JuliaStats/MLBase.jl.

I would like to make the whole discussion a bit more specific and apply / adapt these concepts asap to an example like logistic regression (multi-class, kernelized) to see limits and problems.

BigCrunsh avatar Jul 22 '14 07:07 BigCrunsh

My suggestions: we should have three abstract types (probably more, but these are basics): Loss, Regularizer, and RegressionProblem:

An instance of Loss should be just a functor, for example:

abstract Loss

type SquaredLoss <: Loss end

value(f::SquaredLoss, x::Real) = 0.5 * (x * x)
deriv(f::SquaredLoss, x::Real) = float64(x)

# sometimes, value and deriv can share part of computation
value_and_deriv(f::SquaredLoss, x::Real) = (z = float64(x); (z, 0.5 * (z * z))

The incorporation of input data and loss function should happen during regression. The loss function doesn't have to keep the data and response as fields.

I will talk about Regularizer and RegressionProblem later.

lindahua avatar Jul 22 '14 11:07 lindahua

The computation of total loss and the gradient w.r.t. the linear coefficients can be readily implemented based on the value and deriv functions above, as

n = size(X, 2)
u = X'theta
s = 0.0
d = Array(n)
for i = 1:n
    v, d[i] = value_and_gradient(u[i])
    s += v
end
grad = X * d

Obviously, one can make this into a function, or adapt it to cases with an intercept.

lindahua avatar Jul 22 '14 12:07 lindahua

Similarly, I think w shouldn't be a field of a regularizer, which instead should be supplied as an argument to the function that evaluates the value and/or gradient.

lindahua avatar Jul 22 '14 12:07 lindahua

I agree in principal with these abstract types. For the loss, however, I suggest to have a function that takes a decision value and a label, since there are differences between classification and "regression", e.g.,

value(l::SquaredLoss, f::Real, y::Real) = 0.5 * (f-y) * (f-y)
deriv(l::SquaredLoss, f::Real, y::Real) = float64(f-y)

value(l::LogisticLoss, f::Real, y::Real) = if f>34 -y*f else log(1+exp(-y*f)) end
deriv(l::LogisticLoss, f::Real, y::Real) = -y / (1 + exp(y*f))

Furthermore, I think we need a version that takes a vector of function values, since the loss, e.g., for multi-class logistic regression depends on all decision values.

BigCrunsh avatar Jul 22 '14 12:07 BigCrunsh

Sure. The loss should take a predictor value and response value. That was my oversight.

lindahua avatar Jul 22 '14 12:07 lindahua

For multi-class logistic, we may have another loss abstract type, maybe called MLoss or MultivalueLoss?

Yes, such loss functors should take vector inputs.

lindahua avatar Jul 22 '14 12:07 lindahua

or just a different signature value(l::LogisticLoss, f::Real, y::Real) and value(l::LogisticLoss, f::Vector, y::Real)?

BigCrunsh avatar Jul 22 '14 13:07 BigCrunsh

and perhaps value(l::LogisticLoss, f::Vector), which returns the losses for all possible classes.

BigCrunsh avatar Jul 22 '14 13:07 BigCrunsh

I think we should have different kinds of loss functions depending on the forms of the input and output.

The most common type is those that takes a scalar predictor and compares it with a scalar response. However, as @BigCrunsh suggested, multi-class logistic regression works a little bit different.

The actually estimation algorithm would be different for these two types of losses. For example, the solution for the former should be a vector of length d (or d + 1 if bias/intercept is taken into account), while that for the latter should be a matrix of size (d, m) or (d+1, m).

Hence, it would be useful to have multiple abstract types of loss, so that we can dispatch to different estimation procedures depending on the form of loss.

To be more specific, I think the API can be summarized as follows (taking into account @BigCrunsh's comments above):

abstract AbstractLoss

# for scalar -- scalar
abstract Loss <: AbstractLoss

# for multi-class cases
abstract MulticlassLoss <: AbstractLoss

# Specific examples of Loss (e.g. logistic)
type LogisticLoss <: Loss end

# evaluate the value of loss
value(f::LogisticLoss, x::Real, y::Real) = -log1p(exp(-y * x))

# evaluate derivative w.r.t. to x
deriv(f::SquaredLoss, x::Real, y::Real) = (e = exp(-y * x); e / (1 + e))

# fuse the computation of value and derivative
# this example shows that fused computation is more efficient
# when we want both the value and derivative
function value_and_deriv(f::SquaredLoss, x::Real, y::Real) 
    e = exp(- y * x)
    return (-log1p(e), e / (1 + e))
end

# Example of MulticlassLoss
type MulticlassLogisticLoss <: MulticlassLoss end

function value(f::MulticlassLogisticLoss, x::AbstractVector{Real}, y::Integer)
    es = exp(x)
    log(es[y]) / sum(es)
end

# one can do deriv(f, x, 1:m) to obtain values for all classes jointly
function value(f::MulticlassLogisticLoss, x::AbstractVector{Real}, y::UnitRange)
    es = exp(x)
    log(es[y]) ./ sum(es) 
end

# likewise, we may also define ``deriv`` and ``value_and_deriv``

lindahua avatar Jul 22 '14 20:07 lindahua

Various generic functions can be written based on the loss functors:

# total loss and its gradient w.r.t. theta
function tloss_and_gradient(f::Loss, 
                    X::AbstractMatrix, 
                    y::AbstractVector, 
                    theta::AbstractVector)
    n = size(X, 2)  # n is the number of samples in X
    u = X'theta    
    s = 0.0
    dv = zeros(d)
    for i = 1:n
        v, dv[i] = value_and_deriv(f, u, y[i])
        s += v
    end
    return (s, X * dv)  # X * dv is gradient w.r.t. theta
end

function tloss_and_gradient(f::MulticlassLoss, 
                    X::AbstractMatrix, 
                    y::AbstractVector,  # a vector of integer labels 
                    theta::AbstractMatrix)  # note: theta should be a matrix, size (d, m)
    n = size(X, 2) 
    U = theta'X  # size (m, n), where m is the number of samples
    D = zeros(m, n)
    s = 0.0
    for i = 1:n
        v, dv = value_and_deriv(f, U[:,i], y[i])
        s += v
        D[:,i] = dv
    end
    return (s, A_mul_Bt(X, D))
end

This also demonstrates why we have to introduce Loss and MulticlassLoss.

The tloss_and_gradient is not a great name, and I just use it for example.

lindahua avatar Jul 22 '14 20:07 lindahua

Some modification to the details that may make the implementation more efficient.

First, we probably want a value_and_deriv! for MulticlassLoss to directly write the derives to a pre-allocated array:

function value_and_deriv!(dv::AbstractVector, f::MulticlassLoss, x::AbstractVector, y::Integer)
    # ......
end

# generic for all subtypes of MulticlassLoss
value_and_deriv(f::MulticlassLoss, x::AbstractVector, y::Integer) = 
    value_and_deriv!(Array(Float64, length(x)), f, x, y)

With this, part of the tloss_and_gradient can be made more efficient:

    dv = zeros(m) 
    for i = 1:n
        v = value_and_deriv!(dv, f, view(U,:,i), y[i])
        s += v
        D[:,i] = dv
    end

This avoids repetitively allocating new vectors within the inner loop.

lindahua avatar Jul 22 '14 20:07 lindahua

We can then introduce an abstract type RegressionSolver to solve a regression analysis problem:

abstract RegressionSolver

type MySolver <: RegressionSolver
    ....
end

# any solver type should implement a `solve` method as below
function solve(solver::MySolver, f::Loss, r::Regularizer, X::AbstractMatrix, y::AbstractVector)
    ....
end

# a solver may (optionally) support other types of losses, for example:
function solve(solver::MySolver, f::MulticlassLoss, r::Regularizer, X::AbstractMatrix, y::AbstractVector)
    ....
end

lindahua avatar Jul 22 '14 20:07 lindahua

I hope the several posts above clarify my thoughts about the API design.

lindahua avatar Jul 22 '14 20:07 lindahua

Wouldn't the gradient require knowledge of some link function (maybe determined by RegressionProblem) as well? In tloss_and_gradient you're calculating the gradient as X * dv, but in GLMs for example, isn't that only the case if you're using the canonical link?

lendle avatar Jul 22 '14 21:07 lendle

In this level, we are now only considering loss function and regularization function.

The GLM related stuff like link function and response distribution is at a higher level.

For GLM, we have the loss function is - log p(y|x), which is not the link function itself.

lindahua avatar Jul 22 '14 21:07 lindahua

After reading again, it looks like the argument f in value(::Loss, f, y) is expected to be linear in the coefficients. Is that correct? In that case, the gradient calculation makes sense.

If so, does this mean that a nonlinear regression and multilayer neural nets should not fit in this framework?

lendle avatar Jul 22 '14 21:07 lendle

The current design is for such an optimization problem:

\sum_i f(\theta^T x; y) + r(\theta) 

Here, the first argument to f is a dot product between the parameter theta and x.

It is, however, possible to extend this framework to handle the following generalized cases

\sum_i f( \eta(\theta)^T x; y) + r(\theta) 

by introducing an additional nonlinear transform from \theta to \eta. @lendle: I guess this is what you mean by nonlinear regression?

lindahua avatar Jul 22 '14 21:07 lindahua

@lindahua By nonlinear regression I meant solving the optimization problem

\sum_i f(h(\theta, x); y) + r(\theta) 

for some known, nonlinear h as in nonlinear least squares.

lendle avatar Jul 22 '14 21:07 lendle

I see. It is not difficult to generalize to such cases. I will follow up later.

lindahua avatar Jul 22 '14 22:07 lindahua

I suppose the easiest way would be to have something like

abstract AbstractLoss
abstract Loss <: AbstractLoss
abstract LinLoss <: Loss #probably need a better name
type LogisticLoss <: LinLoss end
type SquaredLoss <: LinLoss end

and define a default tloss_and_gradient! for LinLoss based on value_and_deriv defined for subtypes of LinLoss. Other subtypes of Loss that are not subtypes of LinLoss will have to define their own tloss_and_gradient! method in addition to value_and_deriv.

Another option for names, since it's probably most common for the h function above to be h(theta, x) = theta'x:

abstract AbstractLoss
abstract ScalarLoss <: AbstractLoss 
abstract Loss <: ScalarLoss
abstract NonLinLoss <: ScalarLoss
type LogisticLoss <: Loss end
type SquaredLoss <: Loss end
...

Then loss functions where h is nonlinear can extend NonLinLoss.

lendle avatar Jul 22 '14 22:07 lendle

It might be better for tloss_and_gradient to calculate the mean loss instead of sum, so that the loss is O(1) instead of O(n), and the amount of regularization does not depend on sample size.

lendle avatar Jul 22 '14 22:07 lendle

Generally, the relative amount of regularization does depend on the sample size.

This can be easily seen from a probabilistic perspective. Consider an MAP problem, as

p(\theta) \prod_i p(y_i|x_i; \theta)

This can be turned into a sum:

\log p(\theta) + \sum_i log p(y_i|x_i; \theta)

I don't think it is correct to take the mean in the second term.

This makes a lot of sense. When you have only a small set of observations, you rely more on the prior, however, as more samples are available, the prior would play a relatively smaller roles.

lindahua avatar Jul 22 '14 23:07 lindahua

That's true that you'd want the relative amount of regularization to decrease as your total sample size increases, and typically, you'd probably want to calculate the loss/gradient on the full data set.

In my mind, I guess I think of the optimization problem as the mean loss + regularization, and I'd think of decreasing the relative regularization by decreasing the penalty parameter as sample size increases. I'm probably in the minority there, and it doesn't really matter anyway most of the time.

A use case I was thinking of where it does matter is SGD/mini-batch GD, where you'd calculate the loss/gradient on a small fraction of the full data set, and you could think of that as an estimate of the loss/gradient calculated on the full data set. It's a lot simpler to think about loss + regularization as essentially the same whether you're calculating the loss on 1 observation or n. Hopefully I've articulated that clearly, let me know if not.

@lindahua What are your thoughts on that?

Edit: Maybe I'm not as much in the minority as I previously thought. R's glmnet, scikit-learn's ElasticNet and StatsModels' OLS.fit_regularized all minimize the mean loss + regularization.

lendle avatar Jul 23 '14 00:07 lendle

The behavior as to whether to use the mean or sum can be specified by a keyword argument.

lindahua avatar Jul 23 '14 01:07 lindahua

Oh! You guys are very quick. I am trying to order my thoughts:

Loss types

The actually estimation algorithm would be different for these two types of losses. For example, the solution for the former should be a vector of length d (or d + 1 if bias/intercept is taken into account), while that for the latter should be a matrix of size (d, m) or (d+1, m).

Hence, it would be useful to have multiple abstract types of loss, so that we can dispatch to different estimation procedures depending on the form of loss.

The former is just a special case of the latter for m=1, isn't it? I see no difference between the estimation procedures. I would like to keep it simple as possible. Could we start with just one common abstract loss type and, if it becomes necessary, make this distinction?

Total loss / Non-linear models / Link functions As @lendle mentioned, in @lindahua's version tloss_and_gradient is computing the loss over all examples for a linear model. I see no need to restrict ourselves to linear models and I would like to see that in a bit more general way. The general objective, which we aim to minimize, is

L[\theta] = \sum_i \ell(f_\theta(x_i), y_i) + \Omega[f_\theta],

for some model parameter vector \theta, loss function \ell and regularizer \Omega. This formulation allows us also to incorporate linear models f_\theta(x)=f_w(x)=w*x. Although non-linear models can be traced back to linear models via the mercer_map.jl (as in the current implementation), the formulation above includes also the native dual model f_\theta(x)=f_\alpha(x)=\sum_i \alpha_i *k(x_i,x).

The general gradient is then

L'[\theta] = \sum_i f'_\theta(x_i)\ell'(f_\theta(x_i), y_i) + f'_\theta\Omega'[f_\theta],

where the gradients f', \ell',... are taken w.r.t. to \theta.

So, instead of having a vector theta, I suggest to have a type model here, which has to provide the functionality of a value and a gradient w.r.t. the model parameters. Then, the loss itself is independent of how the decision values are computed. This was the intend in the current implementation for model.

I am not absolutely sure, but isn't a link function just reflected by a specific type of a loss function? If so, it is already captured, right @lendle.

SGD/mini-batch I agree with @lindahua that we should keep the analogy of RegERMs to MAP estimation. And yes, @lendle, we also need the possibility to calculate stochastic gradients of the object for SGD/mini batches. But this isn't a problem (see e.g., RegERMs.jl (line 23)). We should have also a function like stochastic_loss_and_gradient that computes a part of the gradient as:

\sum_{i \in S} \ell(f, x_i, y_i) + \frac{1}{|S|}\Omega[f],

where S is the subset of all examples.

Log loss Bit unrelated, but for later numerical stability, we should "approximate" the log loss log(f,x,y) by -y*f for f>34 and for the multi-class version (soft-max) subtract the maximum over all decision values before calculating the exp (es = exp(x-max); this is equivalent but stable).

BigCrunsh avatar Jul 23 '14 07:07 BigCrunsh

What do you guys think of a bottom-up approach. Where we just start to try this on an example in a WIP branch and look how it feels? I added a first attempt here https://github.com/JuliaStats/RegERMs.jl/pull/4 and tried some concepts. Feel free to comment (it is neither complete nor efficient yet).

BigCrunsh avatar Jul 23 '14 08:07 BigCrunsh

@BigCrunsh Thanks for starting working on it. Would you please also try implement the multi-class logistic regression? I think that would show how the optimization procedure for multi-class logistic regression and standard logistic regression are quite different from each other.

See my comments. There are a few efficiency suggestions. But the main issue that I see is that you are currently using symbols to indicate different solvers, that would make the framework difficult to extend. It should be better to introduce RegressionSolver and a set of subtypes to indicate different solvers

abstract RegressionSolver

type SGDSolver <: RegressionSolver 
    # some fields to specify how the learning rates are defined
end

function optimize(solver::SGDSolver, X::AbstractMatrix, y::AbstractVector)
    ....
end

type BFGSSolver <: RegressionSolver end

function optimize(solver::BFGSSolver, X::AbstractMatrix, y::AbstractVector)
    ....
end

Any one can implement his own solver without hacking around the if-else statements.

In terms of generalizing to nonlinear regression, I think @lendle's formulation is to use the following as the risk term.

loss(h(x, \theta), y)

I agree with that we should support this.

Generally, the risk computation consists of two steps:

  1. produce a prediction using h(x, \theta). In linear cases, h(x, \theta) = \theta'x.
  2. measure the deviation between the prediction and response, using the loss function.

These two steps are kind of orthogonal to each other. Using GLM's terminology, h may be called a Predictor.

It is not difficult to generalize the framework by incorporating a predictor in the optimize function. A predictor should be able to produce a prediction value and a derivative w.r.t. \theta. When the predictor is omitted, we fallback to the standard linear cases.

function optimize(solver, predictor, X, y)

function optimize(solver, X, y)   # falls back to linear cases

lindahua avatar Jul 23 '14 12:07 lindahua

@lindahua: yes, I agree. A soon as I have time, I will start with the multi-class logreg. Currently, I just want to see how conceptual things are working. Feel free to join (especially feel free fixing the efficiency issues).

BigCrunsh avatar Jul 23 '14 15:07 BigCrunsh

Regarding the RegressionSolver in principle, I also agree. However, I would like to build up on Optim.jl whenever it is possible in order to avoid double work.

BigCrunsh avatar Jul 23 '14 15:07 BigCrunsh

I agree with first focus on getting things work. I can always help to optimize the performance once your PR is merged.

We can have a BFGSSolver that simply delegates to the BFGS algorithm in Optims, and we can have a SGDSolver also.

In this way, we don't repeat the work already done in Optims, we just use them.

The purpose of different solver subtypes is primarily for the purpose of dispatch. The internal implementation can be delegated to other's functions or implemented by ourselves.

lindahua avatar Jul 23 '14 15:07 lindahua