EvoTrees.jl EvoTrees consumes too much memory and crashes for sparse matrices

Let us consider the real ML problem (I cannot show the feature names but each row is a separate feature) with 600k observations:

As we can see we have many categorical features with approximately 1500 categories for each feature on average.

Let us approximate real features with random one-hot encoded features represented by sparse matrices and try to apply both XGBoost and EvoTrees:

As we can see XGBoost handles very sparse data very well while EvoTrees allocates too much memory and crashes with OOM error.

The notebook is attached julia_evotrees.zip

The code to reproduce as the plain script:

# -*- coding: utf-8 -*-
# ---
# jupyter:
#   jupytext:
#     formats: ipynb,jl:light
#     text_representation:
#       extension: .jl
#       format_name: light
#       format_version: '1.5'
#       jupytext_version: 1.11.1
#   kernelspec:
#     display_name: Julia 1.6.0-rc3
#     language: julia
#     name: julia-1.6
# ---

# + tags=[]
using Pkg; Pkg.activate(".");

# + tags=[]
Pkg.add(["Statistics", "StatsBase", "Revise", "EvoTrees", "BenchmarkTools", "CUDA", "SparseArrays"]);

# + tags=[]
using Statistics
using StatsBase
using XGBoost
using Revise
using EvoTrees
using BenchmarkTools
using CUDA
using SparseArrays

# + tags=[]
nrounds = 200;
nthread = Threads.nthreads();

# + tags=[]
# xgboost aprams
params_xgb = ["max_depth" => 5,
         "eta" => 0.05,
         "objective" => "reg:squarederror",
         "print_every_n" => 5,
         "subsample" => 0.5,
         "colsample_bytree" => 0.5,
         "tree_method" => "hist",
         "max_bin" => 64]
metrics = ["rmse"]

# + tags=[]
# EvoTrees params
params_evo = EvoTreeRegressor(T=Float32,
        loss=:linear, metric=:mse,
        nrounds=nrounds, α=0.5,
        λ=0.0, γ=0.0, η=0.05,
        max_depth=6, min_weight=1.0,
        rowsample=0.5, colsample=0.5, nbins=64)

# + tags=[]
function random_select(n::Int64,K::Int64) 
    @assert 0<=n<=K

    sample=Vector{Int64}(undef, n)
    t=Int64(0)
    m=Int64(0)

    while m<n
        if (K-t)*rand()>=n-m
            t+=1
        else
            m+=1
            sample[m]=t
            t+=1
        end
    end
    sample
end


# + tags=[]
function create_sparseMatrix(n::Int64,N::Int64,M::Int64)
    @assert (0<=N)&&(0<=M)
    @assert 0<=n<=N*M

    nonZero = random_select(n,N*M)

    # column major: k=i+j*N
    I = map(k->mod(k,N),nonZero)
    J = map(k->div(k,N),nonZero)

    sparse(I.+1,J.+1,ones(n),N,M)
end
# -

sparseones(N,M,K) = sparse(
  (x->(first.(x).+1,last.(x).+1))(divrem.(sample(0:N*M-1,K,replace=false),M))...,
  ones(K),N,M
)

# + tags=[]
nobs = Int(600000)
num_feat = Int(50);
n_cats_per_feature = Int(1500);
@info "testing with: $nobs observations | $num_feat features."

# + tags=[]
X = sparseones(nobs, num_feat*n_cats_per_feature, Int64(nobs*num_feat));

# + tags=[]
Y = rand(size(X, 1));

# + tags=[]
@info "xgboost train:"
@time m_xgb = xgboost(X, nrounds, label=Y, param=params_xgb, metrics=metrics, nthread=nthread, silent=1);
# @btime xgboost($X, $nrounds, label=$Y, param=$params_xgb, metrics=$metrics, silent=1);

# + tags=[]
@info "xgboost predict:"
@time pred_xgb = XGBoost.predict(m_xgb, X);
# @btime XGBoost.predict($m_xgb, $X);

# + tags=[]
@info "evotrees train CPU:"
params_evo.device = "cpu"
@time m_evo = fit_evotree(params_evo, X, Y);
# @btime fit_evotree($params_evo, $X, $Y);
# -

@info "evotrees predict CPU:"
@time pred_evo = EvoTrees.predict(m_evo, X);
#@btime EvoTrees.predict($m_evo, $X);

# + tags=[]
CUDA.allowscalar(false)
@info "evotrees train GPU:"
params_evo.device = "gpu"
@time m_evo_gpu = fit_evotree(params_evo, X, Y);
#@btime fit_evotree($params_evo, $X, $Y);

# + tags=[]
@info "evotrees predict GPU:"
@time pred_evo = EvoTrees.predict(m_evo_gpu, X);
#@btime EvoTrees.predict($m_evo_gpu, $X);
# -

Mar 31 '21 06:03 pgagarinov

EvoTrees is effectively designed around dense matrix, so the poor performance on sparse matrix is unfortunately expected. The library was initially built around my uses cases which didn't involve sparse features.

For clarification, my understanding is that for your use case, the data is sparse as a result of one-hot encoding categorical features with large number of modalities as opposed to being intrinsically sparse from a large number of different features that are sparsely populated, is that correct?

Although later case appears trickier to adapt to, I think support for categorical feature would be a reasonable feature to add. It would involve passing such features as Integers (so a compact "one-cold" representation as opposed to "one-hot"). A treatment similar to what catboost/lightgbm does would then be applied.

This won't be solved by today though :) Suggestions/comments on the proper approach are welcomed!

Mar 31 '21 14:03 jeremiedb

A treatment similar to what catboost/lightgbm does would then be applied.

Can you point out a document explaining how these algorithms handle "categorical" features?

Mar 31 '21 20:03 ablaom

Sure, here are 2 docs about it with CatBoost: https://arxiv.org/pdf/1706.09516.pdf http://learningsys.org/nips17/assets/papers/paper_11.pdf

For lightgbm: https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features, though I haven't access to the referenced 1958 paper :)

There are a couple of ways around it. One can be to simply consider the category index to be treated as a numeric feature. This obviously result in arbitrary grouping of categories, but can nonetheless be valuable. Otherwise, main idea is to have groupings based on target related metrics. For example, a mapping of index into numeric features based on credibility adjusted observed target (so as to limit overfit otherwise associated with straight average of the target). CatBoost has a nice explainer on their methodology here: https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html

So the lazy turnaround to use the category encoding as a numeric input can already be used right now. As for a fancier handling of the groupings, the credibility / regularized target values appears to me as the lowest hanging fruit (similar to the CatBoost explainer), though I'm unclear whether some alternative approach been proven worth the implementation effort.

Apr 01 '21 02:04 jeremiedb

For clarification, my understanding is that for your use case, the data is sparse as a result of one-hot encoding categorical features >with large number of modalities as opposed to being intrinsically sparse from a large number of different features that are sparsely >populated, is that correct?

Yes, that is correct.

This won't be solved by today though :) Suggestions/comments on the proper approach are welcomed!

I bet! As for the comments on the proper approach - I would implement CatBoost encoder from https://contrib.scikit-learn.org/category_encoders/catboost.html in Julia. https://catboost.ai/docs/concepts/educational-materials-papers.html

It would involve passing such features as Integers (so a compact "one-cold" representation as opposed to "one-hot")

I totally agree with this approach - either integers or Categorical Arrays, the rest of the transformations should happen inside the GBM library itself, not like in XGBoost.

Apr 01 '21 17:04 pgagarinov

EvoTrees.jl EvoTrees.jl copied to clipboard

EvoTrees consumes too much memory and crashes for sparse matrices

EvoTrees.jl
EvoTrees.jl copied to clipboard