scikit-lego icon indicating copy to clipboard operation
scikit-lego copied to clipboard

[BUG/HELP REQUIRED] wrong probabilities

Open nipnipj opened this issue 2 years ago • 3 comments

When using predict_proba(newdata) with a BayesianGMMClassifier model, I get 2 col vectors one filled with 1 and the other filled with 0. However, when using predict_proba(train_X) where "train_X" is the data used to fit the model, I get 2 col vectors correctly filled with values between 0 and 1.

nipnipj avatar Aug 03 '22 18:08 nipnipj

hi @nipnipj! Could you provide a small example that reproduces this behaviour?

MBrouns avatar Aug 04 '22 08:08 MBrouns

I'm using R and calling python modules using reticulate package.

data links:

train.csv: https://mega.nz/file/TgcHSBQY#q4yvYuc7VJEzFflCUdCP4hYyUINrhcX6UAQ7p1NDD4c
test.csv: https://mega.nz/file/qpN3nbib#naaxW5Nq99cgWZcGnTvNQ-ZfobwhVp5gImL3Wrlcn3o
library(tidyverse)
library(tidymodels)
library(reticulate)

data_raw <- readr::read_delim(file="~/train.csv", col_names = TRUE, 
                              delim = ",",
                              na=c(""," ","NA")) %>%
  janitor::clean_names() %>% 
  mutate(df = "train") 

test_raw <- readr::read_delim(file="~/test.csv", col_names = TRUE, delim = ",",
                              na=c(""," ","NA")) %>%
  janitor::clean_names() %>% 
  mutate(df = "test") 

factors <- c(paste0("attribute_",0:3), "product_code")

all_data <- data_raw %>% 
  select(-failure) %>% 
  bind_rows(test_raw) %>% 
  mutate(across(c(all_of(factors)), as.factor))

train_data <- all_data %>% 
  filter(df=="train") %>% 
  select(-df) %>% 
  mutate(failure = as.factor(data_raw$failure))

test_data <- all_data %>% 
  filter(df=="test") %>% 
  select(-df)

######### PREPROCESSING
rec <- train_data %>% 
  recipe(failure ~ .) %>% 
  step_rm(id) %>% 
  step_impute_median(all_numeric_predictors()) %>% 
  step_impute_mode(all_nominal_predictors()) %>% 
  step_YeoJohnson(all_numeric_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(all_nominal_predictors(), one_hot = T) %>% 
  prep(strings_as_factors = F)

data <- rec %>% bake(train_data) #%>% slice_sample(n=1000)

######### Model
skl <- import("sklego.mixture")

x <- data %>% select(-failure)
y <- data$failure

model <- skl$BayesianGMMClassifier(n_components = as.integer(2),
                                   covariance_type = 'full',
                                   tol = 1e-3,
                                   random_state = as.integer(2),
                                   n_init = as.integer(3),
                                   max_iter = as.integer(500),
                                   init_params ='kmeans')$fit(x, y)

######### Predict
test <- rec %>% bake(test_data) 
model$predict_proba(test)
model$predict_proba(x)

nipnipj avatar Aug 04 '22 14:08 nipnipj

The given example uses many functions that aren't available to use like step_YeoJohnson. Could you instead share a minimal example of the same behavior in Python? This is a Python project first and foremost, so an example in that language will make it easier for us to check what is happening.

koaning avatar Aug 08 '22 15:08 koaning

Closing due to radio silence.

koaning avatar Aug 30 '22 07:08 koaning