caret icon indicating copy to clipboard operation
caret copied to clipboard

Locale ruins decimal in tuneGrid for xgbTree

Open g1o opened this issue 3 years ago • 1 comments

I had a very strange problem. Train of the xgbTree worked with a clean session, but the second time I tried to run, after a while, it failed.

Turns out that something changed my locale, and then the eta parameter was read as "0,1" when using my locale instead of "0.1". When using eta = 1, without decimal, it worked. Things were solved by setting the locale LC_NUMERIC to 'C' (Sys.setlocale("LC_NUMERIC", 'C' )) , so that it would use dot as a decimal separator.

### CODE

library(caret)
set.seed(1)
dat <- twoClassSim(100)

Sys.setlocale("LC_NUMERIC", 'pt_BR.UTF-8' )


egrid <-
  expand.grid(
    nrounds = c(100, 200, 500),
    max_depth = c(4, 10),
    colsample_bytree = 1,
    eta = (1 / 10) ,
    gamma = 1,
    min_child_weight = 1,
    subsample = 1
  )

control <-
  trainControl(
    method = "cv",
    number = 2,
    classProbs = TRUE,
    summaryFunction = twoClassSummary,
    savePredictions = F,
    preProcOptions = NULL
  )

xgbt_test <-
  train(
    Class ~ .,
    data =  dat  ,
    metric = "ROC",
    method = "xgbTree",
    trControl = control,
    tuneGrid = egrid ,
    nthread = 1
  )

Something is wrong; all the ROC metric values are missing:
      ROC           Sens          Spec    
 Min.   : NA   Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA   Max.   : NA  
 NA's   :6     NA's   :6     NA's   :6    
Error: Stopping
In addition: Warning messages:
1: model fit failed for Fold1: eta=0,1, max_depth= 4, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  Some trailing characters could not be parsed: ',1'
 
2: model fit failed for Fold1: eta=0,1, max_depth=10, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  Some trailing characters could not be parsed: ',1'
 
3: model fit failed for Fold2: eta=0,1, max_depth= 4, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  Some trailing characters could not be parsed: ',1'
 
4: model fit failed for Fold2: eta=0,1, max_depth=10, gamma=1, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=500 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  Some trailing characters could not be parsed: ',1'
 
5: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

### FIX
Sys.setlocale("LC_NUMERIC", 'C' )

xgbt_test <-
  train(
    Class ~ .,
    data = dat ,
    metric = "ROC",
    method = "xgbTree",
    trControl = control,
    tuneGrid = egrid ,
    nthread = 1
  ) #no warnings now.

### Session Info:
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /mnt/DATABASES/anaconda3/envs/giovannimc/lib/libmkl_rt.so.1

locale:
 [1] LC_CTYPE=pt_BR.UTF-8          LC_NUMERIC=pt_BR.UTF-8
 [3] LC_TIME=en_GB.UTF-8           LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8       LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=pt_BR.UTF-8          LC_NAME=pt_BR.UTF-8
 [9] LC_ADDRESS=pt_BR.UTF-8        LC_TELEPHONE=pt_BR.UTF-8
[11] LC_MEASUREMENT=pt_BR.UTF-8    LC_IDENTIFICATION=pt_BR.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pROC_1.16.2                 GeneEssentiality_1.0.1.1000
[3] PRROC_1.3.1                 caret_6.0-86
[5] ggplot2_3.3.2               lattice_0.20-38

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           pillar_1.4.6         compiler_3.6.1
 [4] gower_0.2.2          plyr_1.8.6           iterators_1.0.12
 [7] class_7.3-15         tools_3.6.1          rpart_4.1-15
[10] ipred_0.9-9          lubridate_1.7.9      lifecycle_0.2.0
[13] tibble_3.0.3         nlme_3.1-139         gtable_0.3.0
[16] pkgconfig_2.0.3      rlang_0.4.7          Matrix_1.2-17
[19] foreach_1.5.0        prodlim_2019.11.13   e1071_1.7-3
[22] ranger_0.12.1        stringr_1.4.0        withr_2.2.0
[25] dplyr_1.0.0          generics_0.0.2       vctrs_0.3.2
[28] recipes_0.1.13       xgboost_1.1.1.1      stats4_3.6.1
[31] grid_3.6.1           nnet_7.3-12          tidyselect_1.1.0
[34] data.table_1.13.0    glue_1.4.1           R6_2.4.1
[37] survival_2.44-1.1    lava_1.6.7           reshape2_1.4.4
[40] purrr_0.3.4          magrittr_1.5         ModelMetrics_1.2.2.2
[43] scales_1.1.1         codetools_0.2-16     ellipsis_0.3.1
[46] MASS_7.3-51.3        splines_3.6.1        randomForest_4.6-14
[49] timeDate_3043.102    colorspace_1.4-1     stringi_1.4.6
[52] munsell_0.5.0        crayon_1.3.4

g1o avatar Sep 19 '21 14:09 g1o

Sorry. That must have taken forever to figure out.

For caret, we just pass off the data to xgboost (no parsing on our side). For your first example, just before the model is fit, the data are in a proper format (stored as numeric but printed as "0,1":

Browse[2]> tuneValue
  eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
1 0,1         4     1                1                1         1     500
Browse[2]> str(tuneValue)
'data.frame':	1 obs. of  7 variables:
 $ eta             : num 0,1
 $ max_depth       : num 4
 $ gamma           : num 1
 $ colsample_bytree: num 1
 $ min_child_weight: num 1
 $ subsample       : num 1
 $ nrounds         : num 500

I hate to pass you off to someone else, but I think that this has to be fixed by xgboost.

topepo avatar Sep 20 '21 01:09 topepo