FSelectorRcpp icon indicating copy to clipboard operation
FSelectorRcpp copied to clipboard

Does FSelectorRcpp produce the same results as FSelector?

Open larskotthoff opened this issue 8 years ago • 16 comments

Do you guys have any tests to check this? We're thinking of replacing FSelector with FSelectorRcpp in mlr, but we'd like to be sure that we remain reproducible.

@berndbischl

larskotthoff avatar Mar 15 '17 08:03 larskotthoff

For some functions - yes - see e.g. https://github.com/mi2-warsaw/FSelectorRcpp/blob/master/tests/testthat/test-information_gain.R

This is part of the code:

test_that("Comparsion with FSelector", {
  expect_equal(information.gain(Species ~ ., data = iris)$attr_importance,
               information_gain(formula = Species ~ ., data = iris)$importance)

  expect_equal(gain.ratio(Species ~ ., data = iris)$attr_importance,
               information_gain(formula = Species ~ ., data = iris,
                                type = "gainratio")$importance)

  expect_equal(symmetrical.uncertainty(Species ~ .,
                                       data = iris)$attr_importance,
               information_gain(formula = Species ~ ., data = iris,
                                type = "symuncert")$importance)
})

For other functions please send us a list of functionalities which must be checked against FSelector, and then we will prepare required tests to convince you that everything is fine:)

zzawadz avatar Mar 15 '17 09:03 zzawadz

I'd love to see tests for all of the functions that users can call, ideally on a range of different inputs. Maybe using quickcheck (https://github.com/RevolutionAnalytics/quickcheck)?

larskotthoff avatar Mar 15 '17 09:03 larskotthoff

Oh and once I'm convinced I'm willing to officially deprecate FSelector in favour of FSelectorRcpp.

larskotthoff avatar Mar 15 '17 09:03 larskotthoff

Ok. We will work on this.

Thanks!

zzawadz avatar Mar 15 '17 09:03 zzawadz

@zzawadz another amazing challenge for FSelectorRcpp : )

Maybe it'll be the easiest way to include FSelectorRcpp in the FSelector

MarcinKosinski avatar Mar 15 '17 09:03 MarcinKosinski

@MarcinKosinski Good idea. We can replace functionalities (inner implementation) in FSelector step by step to reach the convergence. @larskotthoff What do you think?

zzawadz avatar Mar 15 '17 09:03 zzawadz

Sounds good. Pull requests welcome!

larskotthoff avatar Mar 15 '17 10:03 larskotthoff

So this can be closed - https://github.com/mi2-warsaw/FSelectorRcpp/issues/27 : ) @larskotthoff is aware of that we will suggest inner implementation

MarcinKosinski avatar Mar 15 '17 11:03 MarcinKosinski

Getting back to this thread. FSelectorRcpp will be available on CRAN again soon (removed because lack of informtion of C++ dependency) https://github.com/mi2-warsaw/FSelectorRcpp/issues/69

To enable FSelectorRcpp be a part of FSelector engine I think we could try substituting

FSelector:::information.gain.body() function with the FSelectorRcpp::information_gain(). We need to polish FSelectorRcpp edition to produce the same results as FSelector and also enable some another approaches to dealing with NAs and discretization of dependent variable.

2 tasks should be finished then

  • [ ] https://github.com/mi2-warsaw/FSelectorRcpp/issues/62 Enable dependent variable discretization the same as FSelector:::equal.frequency.binning.discretization - FSelectorRcpp does not provide discretization for the dependent variable. To make it suitable with FSelector we will enable extra option to discretize the dependent variable (
FSelector:::information.gain.body <- function(params, equal = TRUE) {
    FSelectorRcpp::information_gain(params, equal = equal)
}
  • [ ] https://github.com/mi2-warsaw/FSelectorRcpp/issues/63 Enable FSelectoRcpp dealing with NAs in explanatory variables as in the RWeka::Discretize we slightly need to reorganize the code, so that we only remove rows that have NAs in the dependent variable (and not in any variable considered to be discretized as it was done before) and that we can provide the exact same explanatory variable discretization as in RWeka::Discretize

MarcinKosinski avatar Aug 28 '17 09:08 MarcinKosinski

Hi, I am struggling to get the same results from FSelectorRcpp and FSelector - posted under this issue: https://github.com/mlr-org/mlr/issues/1677#issuecomment-431234791. The results I get are actually very different, and the impact on an end model is large. Would appreciate your help if I am doing anything wrong. Thanks!

RandomGuessR avatar Oct 19 '18 04:10 RandomGuessR

@RandomGuessR

FSelectorRcpp treats integer columns like factors, not numeric, and because of that, it does not discretize them before calculating the information gain. You need to cast the integers columns into numerics to get the same result:

See the code below:

library(FSelectorRcpp)
library(FSelector)
dt <- read.csv("~/Downloads/all/train.csv")

dt2 <- data.frame(
  yy = dt$target,
  X0deb4b6a8 = dt$X0deb4b6a8,
  X0deb4b6a8Numeric = as.numeric(dt$X0deb4b6a8)
)

information_gain(yy ~ ., dt2, equal = TRUE)
#          attributes  importance
# 1        X0deb4b6a8 0.001443917
# 2 X0deb4b6a8Numeric 0.000000000


information.gain(yy ~ ., dt2)
#                   attr_importance
# X0deb4b6a8                      0
# X0deb4b6a8Numeric               0

zzawadz avatar Oct 19 '18 08:10 zzawadz

Thanks for helping with this so quickly! Might be good to document this difference somewhere in the package(s)

RandomGuessR avatar Oct 19 '18 08:10 RandomGuessR

Kudos for Zzawadz

pt., 19 paź 2018, 10:34 użytkownik RandomGuessR [email protected] napisał:

Thanks for helping with this so quickly!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mi2-warsaw/FSelectorRcpp/issues/51#issuecomment-431288122, or mute the thread https://github.com/notifications/unsubscribe-auth/AGdazkRXSO22dOfoPnz52ZiKuTYusGvSks5umY6kgaJpZM4MdpFY .

MarcinKosinski avatar Oct 19 '18 10:10 MarcinKosinski

@RandomGuessR @MarcinKosinski

I found an inconsistent behavior in FSelectorRcpp:( The information_gain does not discretize integers, but discretize do this:( I consider this as a bug, and I'll fix this.

zzawadz avatar Oct 19 '18 10:10 zzawadz

Thanks @zzawadz.

After changing the data from integer to numeric, FSelectorRcpp works like a treat; really happy with the performance. The RWeka-based implementation was too slow for most real-world practical purposes.

RandomGuessR avatar Oct 19 '18 11:10 RandomGuessR

We (I should say me) decided that FSelectorRcpp will try to mimic the behavior of FSelector so that since 0.3.0 integers will be treated as numerics by default, not factors.

zzawadz avatar Nov 10 '18 16:11 zzawadz