agtboost icon indicating copy to clipboard operation
agtboost copied to clipboard

Add sparsity and agtboost matrix class to hold pointer to C++ design matrix and response vector

Open Blunde1 opened this issue 3 years ago • 1 comments

Should be possible with the Eigen sparse matrix class + R Matrix package and possible RcppModules to return pointer to C++ model object.

Blunde1 avatar Aug 18 '20 09:08 Blunde1

Using sparse matrices (for 1-hot encoded categorical vars) can have the benefit of lower RAM usage and faster training, especially for larger datasets. For example for xgboost:

train size 100K: time (sec): sparse: 17.3, dense: 18.3

train size 1M: time (sec): sparse: 38.0, dense: 86.5 RAM usage: sparse: ~1GB, dense: ~23GB


library(data.table)
library(ROCR)
library(xgboost)
library(Matrix)

set.seed(123)

d_train <- fread("https://github.com/szilard/benchm-ml--data/raw/master/train-0.1m.csv")
#d_train <- fread("https://github.com/szilard/benchm-ml--data/raw/master/train-1m.csv")
d_test <- fread("https://github.com/szilard/benchm-ml--data/raw/master/test.csv")

X_train_test <- model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
#X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))

n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]
y_train <- ifelse(d_train$dep_delayed_15min=='Y',1,0)

dxgb_train <- xgb.DMatrix(data = X_train, label = y_train)

system.time({
  md <- xgb.train(data = dxgb_train, 
                  objective = "binary:logistic", 
                  nround = 100, max_depth = 10, eta = 0.1, 
                  tree_method = "hist")
})

phat <- predict(md, newdata = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")

szilard avatar Sep 04 '20 09:09 szilard