scikit-lego
scikit-lego copied to clipboard
[BUG/HELP REQUIRED] wrong probabilities
When using predict_proba(newdata)
with a BayesianGMMClassifier
model, I get 2 col vectors one filled with 1 and the other filled with 0. However, when using predict_proba(train_X)
where "train_X" is the data used to fit the model, I get 2 col vectors correctly filled with values between 0 and 1.
hi @nipnipj! Could you provide a small example that reproduces this behaviour?
I'm using R and calling python modules using reticulate
package.
data links:
train.csv: https://mega.nz/file/TgcHSBQY#q4yvYuc7VJEzFflCUdCP4hYyUINrhcX6UAQ7p1NDD4c
test.csv: https://mega.nz/file/qpN3nbib#naaxW5Nq99cgWZcGnTvNQ-ZfobwhVp5gImL3Wrlcn3o
library(tidyverse)
library(tidymodels)
library(reticulate)
data_raw <- readr::read_delim(file="~/train.csv", col_names = TRUE,
delim = ",",
na=c(""," ","NA")) %>%
janitor::clean_names() %>%
mutate(df = "train")
test_raw <- readr::read_delim(file="~/test.csv", col_names = TRUE, delim = ",",
na=c(""," ","NA")) %>%
janitor::clean_names() %>%
mutate(df = "test")
factors <- c(paste0("attribute_",0:3), "product_code")
all_data <- data_raw %>%
select(-failure) %>%
bind_rows(test_raw) %>%
mutate(across(c(all_of(factors)), as.factor))
train_data <- all_data %>%
filter(df=="train") %>%
select(-df) %>%
mutate(failure = as.factor(data_raw$failure))
test_data <- all_data %>%
filter(df=="test") %>%
select(-df)
######### PREPROCESSING
rec <- train_data %>%
recipe(failure ~ .) %>%
step_rm(id) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = T) %>%
prep(strings_as_factors = F)
data <- rec %>% bake(train_data) #%>% slice_sample(n=1000)
######### Model
skl <- import("sklego.mixture")
x <- data %>% select(-failure)
y <- data$failure
model <- skl$BayesianGMMClassifier(n_components = as.integer(2),
covariance_type = 'full',
tol = 1e-3,
random_state = as.integer(2),
n_init = as.integer(3),
max_iter = as.integer(500),
init_params ='kmeans')$fit(x, y)
######### Predict
test <- rec %>% bake(test_data)
model$predict_proba(test)
model$predict_proba(x)
The given example uses many functions that aren't available to use like step_YeoJohnson
. Could you instead share a minimal example of the same behavior in Python? This is a Python project first and foremost, so an example in that language will make it easier for us to check what is happening.
Closing due to radio silence.