ParBayesianOptimization icon indicating copy to clipboard operation
ParBayesianOptimization copied to clipboard

after running parallelization the result is not reproducible

Open lulu36-wxc opened this issue 1 year ago • 1 comments

Thank you for this amazing package! I run the hyperparameter tuning code for xgboost and the result sometimes is reproducible and sometimes not, is this because of the reason I run parallelization? scoringFunction <- function(max_depth, min_child_weight, subsample) {

dtrain <- xgb.DMatrix(agaricus.train$data,label = agaricus.train$label)

Pars <- list( booster = "gbtree" , eta = 0.001 , max_depth = max_depth , min_child_weight = min_child_weight , subsample = subsample , objective = "binary:logistic" , eval_metric = "auc" )

xgbcv <- xgb.cv( params = Pars , data = dtrain , nround = 100 , folds = Folds , early_stopping_rounds = 5 , maximize = TRUE , verbose = 0 )

return(list(Score = max(xgbcv$evaluation_log$test_auc_mean) , nrounds = xgbcv$best_iteration ) ) }

bounds <- list( max_depth = c(1L, 5L) , min_child_weight = c(0, 25) , subsample = c(0.25, 1) )

set.seed(42) library(doParallel) cl <- makeCluster(2) registerDoParallel(cl) clusterExport(cl,c('Folds','agaricus.train')) clusterEvalQ(cl,expr= { library(xgboost) })

set.seed(42) tWithPar <- system.time( optObj <- bayesOpt( FUN = scoringFunction , bounds = bounds , initPoints = 4 , iters.n = 4 , iters.k = 2 , parallel = TRUE ) ) stopCluster(cl) registerDoSEQ()

the code is like this, but getBestPars(optObj) get different every time when I run exactly the same code, the score summary is similar to #52, and the parameters chosen are the same but the score is different. I just wonder whether this is because of the parallelization or other reasons.

I also run the code you mentioned in #7 and the result is FALSE, but the scores summary table seems to be the same for optobj and optobj2, so I guess the reason of different results for running the same code several times is because of parallelization?

lulu36-wxc avatar Aug 03 '23 01:08 lulu36-wxc

The code snippet is not runnable (not sure what the object Folds is) so I am not able to say for sure but I think the problem is with the set.seed before calling the bayesOpt. This is not sufficient since each thread has its own random number generator. This is why in #52 the person used the function clusterSetRNGStream and not set.seed().

When I just used nfold instead of folds parameter in xgb.cv, the result is reproducible (widh clusterSetRNGStream instead of set.seed)

Rek27 avatar Mar 20 '24 10:03 Rek27