SymbolicRegression.jl
SymbolicRegression.jl copied to clipboard
Variable subset selection for linear regression
I noticed that it took quite some time for the symbolic regression to beat a simple variable subset selection using linear regression. To clarify what I mean with this, I consider a large regressor matrix A
in the problem y = A*b
where b
are the parameters and the task is to select a subset of the columns of A
to include in the linear model. This can typically be solved to near optimality with LASSO regression. After running for long enough, the symbolic regression did indeed find a better equation for my toy problem at a similar number of estimated parameters that did my subset selection, but it makes me wonder if subset selection can be used as a heuristic approach to seed some of the population members?
For reference, I include a naive, brute force algorithm that selects n
variables from a regressor matrix A
in an exact way, in case my above explanation didn't make sense
using IterTools, TotalLeastSquares
function ls_selection(A,y, n)
m = size(A, 2)
errors = Float64[]
indices = Vector{Int}[]
for ni = 1:n
for inds = IterTools.subsets(1:m, ni)
At = A[:,inds]
θ = At \ y
E = y - At*θ
e = sum(abs, E)
push!(errors, e)
push!(indices, inds)
end
end
mi = indices[argmin(errors)]
mi, errors, indices
end
This sort of comparison is completely expected. Any algorithm that searches on a small subspace of equation space is going to beat a genetic algorithm which searches the complete equation space. e.g., for typical numbers, the linear equation space would have 10^10 fewer combinatorial possibilities than the full equation space.
I like the idea that allowing the user to put in priors on the equation, such as a high frequency of polynomial terms, would be really useful for these sorts of common equations! Maybe a polynomial search could be part of the internal loop? Although this is very specific to these sorts of equations, whereas SymbolicRegression.jl can work with arbitrary (even non-differentiable) operators.
also see https://github.com/MilesCranmer/PySR/issues/15
(The core algorithm improved a lot over the past year - it should do fine at polynomial searches now, so long as the true equation is a relatively small polynomial)