FSelectorRcpp
FSelectorRcpp copied to clipboard
Does FSelectorRcpp produce the same results as FSelector?
Do you guys have any tests to check this? We're thinking of replacing FSelector with FSelectorRcpp in mlr, but we'd like to be sure that we remain reproducible.
@berndbischl
For some functions - yes - see e.g. https://github.com/mi2-warsaw/FSelectorRcpp/blob/master/tests/testthat/test-information_gain.R
This is part of the code:
test_that("Comparsion with FSelector", {
expect_equal(information.gain(Species ~ ., data = iris)$attr_importance,
information_gain(formula = Species ~ ., data = iris)$importance)
expect_equal(gain.ratio(Species ~ ., data = iris)$attr_importance,
information_gain(formula = Species ~ ., data = iris,
type = "gainratio")$importance)
expect_equal(symmetrical.uncertainty(Species ~ .,
data = iris)$attr_importance,
information_gain(formula = Species ~ ., data = iris,
type = "symuncert")$importance)
})
For other functions please send us a list of functionalities which must be checked against FSelector, and then we will prepare required tests to convince you that everything is fine:)
I'd love to see tests for all of the functions that users can call, ideally on a range of different inputs. Maybe using quickcheck (https://github.com/RevolutionAnalytics/quickcheck)?
Oh and once I'm convinced I'm willing to officially deprecate FSelector in favour of FSelectorRcpp.
Ok. We will work on this.
Thanks!
@zzawadz another amazing challenge for FSelectorRcpp : )
Maybe it'll be the easiest way to include FSelectorRcpp in the FSelector
@MarcinKosinski Good idea. We can replace functionalities (inner implementation) in FSelector step by step to reach the convergence. @larskotthoff What do you think?
Sounds good. Pull requests welcome!
So this can be closed - https://github.com/mi2-warsaw/FSelectorRcpp/issues/27 : ) @larskotthoff is aware of that we will suggest inner implementation
Getting back to this thread. FSelectorRcpp will be available on CRAN again soon (removed because lack of informtion of C++ dependency) https://github.com/mi2-warsaw/FSelectorRcpp/issues/69
To enable FSelectorRcpp be a part of FSelector engine I think we could try
substituting
FSelector:::information.gain.body() function with the FSelectorRcpp::information_gain(). We need to polish FSelectorRcpp edition to produce the same results as FSelector and also enable some another approaches to dealing with NAs and discretization of dependent variable.
2 tasks should be finished then
- [ ] https://github.com/mi2-warsaw/FSelectorRcpp/issues/62
Enable dependent variable discretization the same as FSelector:::equal.frequency.binning.discretization-FSelectorRcppdoes not provide discretization for the dependent variable. To make it suitable withFSelectorwe will enable extra option to discretize the dependent variable (
FSelector:::information.gain.body <- function(params, equal = TRUE) {
FSelectorRcpp::information_gain(params, equal = equal)
}
- [ ] https://github.com/mi2-warsaw/FSelectorRcpp/issues/63
Enable FSelectoRcpp dealing with NAs in explanatory variables as in the RWeka::Discretizewe slightly need to reorganize the code, so that we only remove rows that haveNAs in the dependent variable (and not in any variable considered to be discretized as it was done before) and that we can provide the exact same explanatory variable discretization as inRWeka::Discretize
Hi, I am struggling to get the same results from FSelectorRcpp and FSelector - posted under this issue: https://github.com/mlr-org/mlr/issues/1677#issuecomment-431234791. The results I get are actually very different, and the impact on an end model is large. Would appreciate your help if I am doing anything wrong. Thanks!
@RandomGuessR
FSelectorRcpp treats integer columns like factors, not numeric, and because of that, it does not discretize them before calculating the information gain. You need to cast the integers columns into numerics to get the same result:
See the code below:
library(FSelectorRcpp)
library(FSelector)
dt <- read.csv("~/Downloads/all/train.csv")
dt2 <- data.frame(
yy = dt$target,
X0deb4b6a8 = dt$X0deb4b6a8,
X0deb4b6a8Numeric = as.numeric(dt$X0deb4b6a8)
)
information_gain(yy ~ ., dt2, equal = TRUE)
# attributes importance
# 1 X0deb4b6a8 0.001443917
# 2 X0deb4b6a8Numeric 0.000000000
information.gain(yy ~ ., dt2)
# attr_importance
# X0deb4b6a8 0
# X0deb4b6a8Numeric 0
Thanks for helping with this so quickly! Might be good to document this difference somewhere in the package(s)
Kudos for Zzawadz
pt., 19 paź 2018, 10:34 użytkownik RandomGuessR [email protected] napisał:
Thanks for helping with this so quickly!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mi2-warsaw/FSelectorRcpp/issues/51#issuecomment-431288122, or mute the thread https://github.com/notifications/unsubscribe-auth/AGdazkRXSO22dOfoPnz52ZiKuTYusGvSks5umY6kgaJpZM4MdpFY .
@RandomGuessR @MarcinKosinski
I found an inconsistent behavior in FSelectorRcpp:( The information_gain does not discretize integers, but discretize do this:( I consider this as a bug, and I'll fix this.
Thanks @zzawadz.
After changing the data from integer to numeric, FSelectorRcpp works like a treat; really happy with the performance. The RWeka-based implementation was too slow for most real-world practical purposes.