caret-machine-learning
caret-machine-learning copied to clipboard
method elm in caret gives random results with the same seed set
method elm in caret gives random results with the same seed set, most other methods give the same result (as expected). elmNN_1.0 and caret_6.0-58
# load caret and DT the cars data set
require(caret); require(DT); require(mlbench);
library(AppliedPredictiveModeling)
data(solubility)
# load the data and coerce into single frame (legacy)
training_data = data.frame(solTrainX,solTrainY)[1:20,]
testing_data = data.frame(solTestX,solTestY)
# just rename columsn to stay conform with style below
colnames(training_data)[colnames(training_data) == 'solTrainY'] <- 'y'
colnames(testing_data)[colnames(testing_data) == 'solTestY'] <- 'y'
# all the training data (just named x and y)
y <- training_data$y
x <- training_data[, -ncol(training_data)]
# load all libraries
library(doParallel); cl <- makeCluster(8); registerDoParallel(cl)
# RMSE and R2 results should be the same, three times
set.seed(123); result <- train(x,y,"elm"); getTrainPerf(result)
set.seed(123); result <- train(x,y,"elm"); getTrainPerf(result)
set.seed(123); result <- train(x,y,"elm"); getTrainPerf(result)
# stop the parallel processing and register sequential front-end
stopCluster(cl); registerDoSEQ();
gives three different results (instead of one)
> set.seed(123); result <- train(x,y,"elm"); getTrainPerf(result)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
TrainRMSE TrainRsquared method
1 0.0949043 0.08471695 elm
> set.seed(123); result <- train(x,y,"elm"); getTrainPerf(result)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
TrainRMSE TrainRsquared method
1 0.1702852 0.1513969 elm
> set.seed(123); result <- train(x,y,"elm"); getTrainPerf(result)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
TrainRMSE TrainRsquared method
1 0.06912891 0.1076074 elm
expected behavior is that all RMSE and R2 values are the same with the same seed(123):
> set.seed(123); result <- train(x,y,"knn"); getTrainPerf(result)
TrainRMSE TrainRsquared method
1 0.07321177 0.1691287 knn
> set.seed(123); result <- train(x,y,"knn"); getTrainPerf(result)
TrainRMSE TrainRsquared method
1 0.07321177 0.1691287 knn
> set.seed(123); result <- train(x,y,"knn"); getTrainPerf(result)
TrainRMSE TrainRsquared method
1 0.07321177 0.1691287 knn
One could use trainControl, but that needs further testing. Around 80 other methods in caret give the same repetitive and correct result with the seed set.
One easy way to run fully reproducible model in parallel mode using the caret package is by using the seeds argument when calling the train control.
see also http://stackoverflow.com/questions/13403427/fully-reproducible-parallel-models-using-caret