agtboost
agtboost copied to clipboard
Add sparsity and agtboost matrix class to hold pointer to C++ design matrix and response vector
Should be possible with the Eigen sparse matrix class + R Matrix package and possible RcppModules to return pointer to C++ model object.
Using sparse matrices (for 1-hot encoded categorical vars) can have the benefit of lower RAM usage and faster training, especially for larger datasets. For example for xgboost:
train size 100K: time (sec): sparse: 17.3, dense: 18.3
train size 1M: time (sec): sparse: 38.0, dense: 86.5 RAM usage: sparse: ~1GB, dense: ~23GB
library(data.table)
library(ROCR)
library(xgboost)
library(Matrix)
set.seed(123)
d_train <- fread("https://github.com/szilard/benchm-ml--data/raw/master/train-0.1m.csv")
#d_train <- fread("https://github.com/szilard/benchm-ml--data/raw/master/train-1m.csv")
d_test <- fread("https://github.com/szilard/benchm-ml--data/raw/master/test.csv")
X_train_test <- model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
#X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]
y_train <- ifelse(d_train$dep_delayed_15min=='Y',1,0)
dxgb_train <- xgb.DMatrix(data = X_train, label = y_train)
system.time({
md <- xgb.train(data = dxgb_train,
objective = "binary:logistic",
nround = 100, max_depth = 10, eta = 0.1,
tree_method = "hist")
})
phat <- predict(md, newdata = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")