caret
caret copied to clipboard
Locale ruins decimal in tuneGrid for xgbTree
I had a very strange problem. Train of the xgbTree worked with a clean session, but the second time I tried to run, after a while, it failed.
Turns out that something changed my locale, and then the eta parameter was read as "0,1" when using my locale instead of "0.1". When using eta = 1, without decimal, it worked. Things were solved by setting the locale LC_NUMERIC to 'C' (Sys.setlocale("LC_NUMERIC", 'C' )) , so that it would use dot as a decimal separator.
### CODE
library(caret)
set.seed(1)
dat <- twoClassSim(100)
Sys.setlocale("LC_NUMERIC", 'pt_BR.UTF-8' )
egrid <-
expand.grid(
nrounds = c(100, 200, 500),
max_depth = c(4, 10),
colsample_bytree = 1,
eta = (1 / 10) ,
gamma = 1,
min_child_weight = 1,
subsample = 1
)
control <-
trainControl(
method = "cv",
number = 2,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = F,
preProcOptions = NULL
)
xgbt_test <-
train(
Class ~ .,
data = dat ,
metric = "ROC",
method = "xgbTree",
trControl = control,
tuneGrid = egrid ,
nthread = 1
)
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA
NA's :6 NA's :6 NA's :6
Error: Stopping
In addition: Warning messages:
1: model fit failed for Fold1: eta=0,1, max_depth= 4, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :
Some trailing characters could not be parsed: ',1'
2: model fit failed for Fold1: eta=0,1, max_depth=10, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :
Some trailing characters could not be parsed: ',1'
3: model fit failed for Fold2: eta=0,1, max_depth= 4, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :
Some trailing characters could not be parsed: ',1'
4: model fit failed for Fold2: eta=0,1, max_depth=10, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :
Some trailing characters could not be parsed: ',1'
5: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
### FIX
Sys.setlocale("LC_NUMERIC", 'C' )
xgbt_test <-
train(
Class ~ .,
data = dat ,
metric = "ROC",
method = "xgbTree",
trControl = control,
tuneGrid = egrid ,
nthread = 1
) #no warnings now.
### Session Info:
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /mnt/DATABASES/anaconda3/envs/giovannimc/lib/libmkl_rt.so.1
locale:
[1] LC_CTYPE=pt_BR.UTF-8 LC_NUMERIC=pt_BR.UTF-8
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=pt_BR.UTF-8 LC_NAME=pt_BR.UTF-8
[9] LC_ADDRESS=pt_BR.UTF-8 LC_TELEPHONE=pt_BR.UTF-8
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=pt_BR.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pROC_1.16.2 GeneEssentiality_1.0.1.1000
[3] PRROC_1.3.1 caret_6.0-86
[5] ggplot2_3.3.2 lattice_0.20-38
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.6 compiler_3.6.1
[4] gower_0.2.2 plyr_1.8.6 iterators_1.0.12
[7] class_7.3-15 tools_3.6.1 rpart_4.1-15
[10] ipred_0.9-9 lubridate_1.7.9 lifecycle_0.2.0
[13] tibble_3.0.3 nlme_3.1-139 gtable_0.3.0
[16] pkgconfig_2.0.3 rlang_0.4.7 Matrix_1.2-17
[19] foreach_1.5.0 prodlim_2019.11.13 e1071_1.7-3
[22] ranger_0.12.1 stringr_1.4.0 withr_2.2.0
[25] dplyr_1.0.0 generics_0.0.2 vctrs_0.3.2
[28] recipes_0.1.13 xgboost_1.1.1.1 stats4_3.6.1
[31] grid_3.6.1 nnet_7.3-12 tidyselect_1.1.0
[34] data.table_1.13.0 glue_1.4.1 R6_2.4.1
[37] survival_2.44-1.1 lava_1.6.7 reshape2_1.4.4
[40] purrr_0.3.4 magrittr_1.5 ModelMetrics_1.2.2.2
[43] scales_1.1.1 codetools_0.2-16 ellipsis_0.3.1
[46] MASS_7.3-51.3 splines_3.6.1 randomForest_4.6-14
[49] timeDate_3043.102 colorspace_1.4-1 stringi_1.4.6
[52] munsell_0.5.0 crayon_1.3.4
Sorry. That must have taken forever to figure out.
For caret
, we just pass off the data to xgboost
(no parsing on our side). For your first example, just before the model is fit, the data are in a proper format (stored as numeric but printed as "0,1":
Browse[2]> tuneValue
eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
1 0,1 4 1 1 1 1 500
Browse[2]> str(tuneValue)
'data.frame': 1 obs. of 7 variables:
$ eta : num 0,1
$ max_depth : num 4
$ gamma : num 1
$ colsample_bytree: num 1
$ min_child_weight: num 1
$ subsample : num 1
$ nrounds : num 500
I hate to pass you off to someone else, but I think that this has to be fixed by xgboost
.