EvoTrees.jl
EvoTrees.jl copied to clipboard
EvoTrees consumes too much memory and crashes for sparse matrices
Let us consider the real ML problem (I cannot show the feature names but each row is a separate feature) with 600k observations:
As we can see we have many categorical features with approximately 1500 categories for each feature on average.
Let us approximate real features with random one-hot encoded features represented by sparse matrices and try to apply both XGBoost and EvoTrees:
As we can see XGBoost handles very sparse data very well while EvoTrees allocates too much memory and crashes with OOM error.
The notebook is attached julia_evotrees.zip
The code to reproduce as the plain script:
# -*- coding: utf-8 -*-
# ---
# jupyter:
# jupytext:
# formats: ipynb,jl:light
# text_representation:
# extension: .jl
# format_name: light
# format_version: '1.5'
# jupytext_version: 1.11.1
# kernelspec:
# display_name: Julia 1.6.0-rc3
# language: julia
# name: julia-1.6
# ---
# + tags=[]
using Pkg; Pkg.activate(".");
# + tags=[]
Pkg.add(["Statistics", "StatsBase", "Revise", "EvoTrees", "BenchmarkTools", "CUDA", "SparseArrays"]);
# + tags=[]
using Statistics
using StatsBase
using XGBoost
using Revise
using EvoTrees
using BenchmarkTools
using CUDA
using SparseArrays
# + tags=[]
nrounds = 200;
nthread = Threads.nthreads();
# + tags=[]
# xgboost aprams
params_xgb = ["max_depth" => 5,
"eta" => 0.05,
"objective" => "reg:squarederror",
"print_every_n" => 5,
"subsample" => 0.5,
"colsample_bytree" => 0.5,
"tree_method" => "hist",
"max_bin" => 64]
metrics = ["rmse"]
# + tags=[]
# EvoTrees params
params_evo = EvoTreeRegressor(T=Float32,
loss=:linear, metric=:mse,
nrounds=nrounds, α=0.5,
λ=0.0, γ=0.0, η=0.05,
max_depth=6, min_weight=1.0,
rowsample=0.5, colsample=0.5, nbins=64)
# + tags=[]
function random_select(n::Int64,K::Int64)
@assert 0<=n<=K
sample=Vector{Int64}(undef, n)
t=Int64(0)
m=Int64(0)
while m<n
if (K-t)*rand()>=n-m
t+=1
else
m+=1
sample[m]=t
t+=1
end
end
sample
end
# + tags=[]
function create_sparseMatrix(n::Int64,N::Int64,M::Int64)
@assert (0<=N)&&(0<=M)
@assert 0<=n<=N*M
nonZero = random_select(n,N*M)
# column major: k=i+j*N
I = map(k->mod(k,N),nonZero)
J = map(k->div(k,N),nonZero)
sparse(I.+1,J.+1,ones(n),N,M)
end
# -
sparseones(N,M,K) = sparse(
(x->(first.(x).+1,last.(x).+1))(divrem.(sample(0:N*M-1,K,replace=false),M))...,
ones(K),N,M
)
# + tags=[]
nobs = Int(600000)
num_feat = Int(50);
n_cats_per_feature = Int(1500);
@info "testing with: $nobs observations | $num_feat features."
# + tags=[]
X = sparseones(nobs, num_feat*n_cats_per_feature, Int64(nobs*num_feat));
# + tags=[]
Y = rand(size(X, 1));
# + tags=[]
@info "xgboost train:"
@time m_xgb = xgboost(X, nrounds, label=Y, param=params_xgb, metrics=metrics, nthread=nthread, silent=1);
# @btime xgboost($X, $nrounds, label=$Y, param=$params_xgb, metrics=$metrics, silent=1);
# + tags=[]
@info "xgboost predict:"
@time pred_xgb = XGBoost.predict(m_xgb, X);
# @btime XGBoost.predict($m_xgb, $X);
# + tags=[]
@info "evotrees train CPU:"
params_evo.device = "cpu"
@time m_evo = fit_evotree(params_evo, X, Y);
# @btime fit_evotree($params_evo, $X, $Y);
# -
@info "evotrees predict CPU:"
@time pred_evo = EvoTrees.predict(m_evo, X);
#@btime EvoTrees.predict($m_evo, $X);
# + tags=[]
CUDA.allowscalar(false)
@info "evotrees train GPU:"
params_evo.device = "gpu"
@time m_evo_gpu = fit_evotree(params_evo, X, Y);
#@btime fit_evotree($params_evo, $X, $Y);
# + tags=[]
@info "evotrees predict GPU:"
@time pred_evo = EvoTrees.predict(m_evo_gpu, X);
#@btime EvoTrees.predict($m_evo_gpu, $X);
# -
EvoTrees is effectively designed around dense matrix, so the poor performance on sparse matrix is unfortunately expected. The library was initially built around my uses cases which didn't involve sparse features.
For clarification, my understanding is that for your use case, the data is sparse as a result of one-hot encoding categorical features with large number of modalities as opposed to being intrinsically sparse from a large number of different features that are sparsely populated, is that correct?
Although later case appears trickier to adapt to, I think support for categorical feature would be a reasonable feature to add. It would involve passing such features as Integers (so a compact "one-cold" representation as opposed to "one-hot"). A treatment similar to what catboost/lightgbm does would then be applied.
This won't be solved by today though :) Suggestions/comments on the proper approach are welcomed!
A treatment similar to what catboost/lightgbm does would then be applied.
Can you point out a document explaining how these algorithms handle "categorical" features?
Sure, here are 2 docs about it with CatBoost: https://arxiv.org/pdf/1706.09516.pdf http://learningsys.org/nips17/assets/papers/paper_11.pdf
For lightgbm: https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features, though I haven't access to the referenced 1958 paper :)
There are a couple of ways around it. One can be to simply consider the category index to be treated as a numeric feature. This obviously result in arbitrary grouping of categories, but can nonetheless be valuable. Otherwise, main idea is to have groupings based on target related metrics. For example, a mapping of index into numeric features based on credibility adjusted observed target (so as to limit overfit otherwise associated with straight average of the target). CatBoost has a nice explainer on their methodology here: https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html
So the lazy turnaround to use the category encoding as a numeric input can already be used right now. As for a fancier handling of the groupings, the credibility / regularized target values appears to me as the lowest hanging fruit (similar to the CatBoost explainer), though I'm unclear whether some alternative approach been proven worth the implementation effort.
For clarification, my understanding is that for your use case, the data is sparse as a result of one-hot encoding categorical features >with large number of modalities as opposed to being intrinsically sparse from a large number of different features that are sparsely >populated, is that correct?
Yes, that is correct.
This won't be solved by today though :) Suggestions/comments on the proper approach are welcomed!
I bet! As for the comments on the proper approach - I would implement CatBoost encoder from https://contrib.scikit-learn.org/category_encoders/catboost.html in Julia. https://catboost.ai/docs/concepts/educational-materials-papers.html
It would involve passing such features as Integers (so a compact "one-cold" representation as opposed to "one-hot")
I totally agree with this approach - either integers or Categorical Arrays, the rest of the transformations should happen inside the GBM library itself, not like in XGBoost.
More articles on the subject:
https://catboost.ai/docs/concepts/educational-materials-papers.html https://towardsdatascience.com/categorical-features-parameters-in-catboost-4ebd1326bee5
We could use https://contrib.scikit-learn.org/category_encoders/_modules/category_encoders/cat_boost.html#CatBoostEncoder as a reference implementation.
This package https://contrib.scikit-learn.org/category_encoders/index.html contains other useful categorical features encoders that are currently missing in MLJ. For now I ended up calling those encoders from Julia via PyCall.
Thanks @pgagarinov. Cross referenced here: https://github.com/alan-turing-institute/MLJModels.jl/issues/375
Given the improvements to memory footprint throughout various releases since the issue was open and the added support for categorical data through the new Tables API since v0.16, I'd close the PR. Don't hesitate to reopen if the mentioned concerned are still an issue.