C5.0
C5.0 copied to clipboard
Prediction on training data does not match with summary output
Hi all , I find a strange result when I try to compare the output of a model from summary() with predict() on the same training set.
The same units are not classified in the same way, so I find two different confusion matrix. The issue arise with trials >1 and at least 3 variables in training set. I would expect the same results but maybe I misunderstood the inner workings of the algo.
I use R 4.1.1 and package C50 0.1.8
This is a code that reproduce the issue from credit_data dataset:
##################################################################################
library(modeldata)
data(credit_data)
vars <- c("Home", "Seniority", 'Job')
# a simple split
set.seed(2411)
in_train <- sample(1:nrow(credit_data), size = 3000)
train_data_example <- credit_data[ in_train,]
test_data_example <- credit_data[-in_train,]
library( C50)
library( yardstick )
tree_mod <- C5.0(x = train_data_example[, vars],
y = train_data_example$Status
, trials = 10
, seed = 65
)
summary(tree_mod)
prediction_df_train <- tibble(value = train_data_example$Status ,
predict = predict(tree_mod, newdata = train_data_example[, vars]) )
conf_mat(prediction_df_train , truth = value, estimate = predict)
confusion matrix in summary( tree_mod )
id different than confusion matrix built from predict()
##################################################################################
Thank you, Massimo