OneR
OneR copied to clipboard
monotonic predictors
In some applications the predictors are monotonic and it does not make sense to have more than one split on a predictor. Suggest having an option to limit trees to a single split.
Thank you for your suggestion. I am not quite sure whether I understand it correctly. Could you provide a concrete example? Thank you again
The following should be increasing but it tries to fit the 0.
library(OneR)
BOD[5, 2] <- 0
OneR(demand ~ Time, BOD)
## Call:
## OneR.formula(formula = demand ~ Time, data = BOD)
##
## Rules:
## If Time = (0.994,2.2] then demand = (7.92,11.9]
## If Time = (2.2,3.4] then demand = (15.8,19.8]
## If Time = (3.4,4.6] then demand = (15.8,19.8]
## If Time = (4.6,5.8] then demand = (-0.0198,3.96] <----------------------------
## If Time = (5.8,7.01] then demand = (15.8,19.8]
##
## Accuracy:
## 6 of 6 instances classified correctly (100%)
Although there exist examples of non-monotonic dose-response curves usually a higher dose leads to a higher response.
Another example, is valuation. Suppose we want to predict the price of a house based on number of bedrooms and other predictors. More bedrooms should lead to a higher valuation.
Usually it is sufficient to guarantee monotonicity without specifying the direction and if the direction is opposite from expected then we can re-examine our assumptions or reject that predictor. A simple way to ensure monotonicity is to allow only one split on each predictor which I am assuming would be easy to implement.
Ok, if I understand you correctly, your question is about the number of splits of the predictors. This should be quite easy to achieve by using the bin()
function with the nbins
argument, specifying the number of bins before you use the OneR()
function on the resulting dataframe (please also consult the documentation for bin()
for an example). So, if you only want one split per predictor you should set nbins = 2
.
Please try this and come back to tell me if it solved your problem. Thank you
That splits each column into two but it doesn't seem to do it optimally. If we try the previous example it gets 2 predictions wrong but it would be possible to get 0 wrong if the splits were different.
BOD[5, 2] <- 0
BOD2 <- bin(BOD, 2)
OneR(demand ~ Time, BOD2)
##
## Call:
## OneR.formula(formula = demand ~ Time, data = BOD2)
##
## Rules:
## If Time = (0.994,4] then demand = (9.9,19.8]
## If Time = (4,7.01] then demand = (-0.0198,9.9]
##
## Accuracy:
## 4 of 6 instances classified correctly (66.67%) <-------------------------------
Just a follow up to the last post. We compute all possible increasing splits into 2 and find that the split Time > 5 demand > 19 gives 0 wrong predictions.
BOD[5,2] <- 0
noWrong <- function(i,j) {
x <- BOD$Time > BOD$Time[i]
y <- BOD$demand > BOD$demand[j]
if (all(x == y)) cat("i=", i, "j=", j,
"Time >", BOD$Time[i], "demand >", BOD$demand[j],
"x=y=", paste(+x, collapse = ""), "\n")
sum(x != y)
}
BOD
## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 0.0
## 6 7 19.8
outer(1:5, 1:5, Vectorize(noWrong))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 4 3 2
## [2,] 2 1 3 2 3
## [3,] 3 2 2 3 4
## [4,] 4 3 1 2 5
## [5,] 3 2 0 1 4
noWrong(5, 3)
## i= 5 j= 3 Time > 5 demand > 19 x=y= 000001
## [1] 0