OneR icon indicating copy to clipboard operation
OneR copied to clipboard

monotonic predictors

Open ggrothendieck opened this issue 1 year ago • 5 comments

In some applications the predictors are monotonic and it does not make sense to have more than one split on a predictor. Suggest having an option to limit trees to a single split.

ggrothendieck avatar Mar 10 '23 14:03 ggrothendieck

Thank you for your suggestion. I am not quite sure whether I understand it correctly. Could you provide a concrete example? Thank you again

vonjd avatar Mar 10 '23 15:03 vonjd

The following should be increasing but it tries to fit the 0.

library(OneR)
BOD[5, 2] <- 0
OneR(demand ~ Time, BOD)

## Call:
## OneR.formula(formula = demand ~ Time, data = BOD)
##
## Rules:
## If Time = (0.994,2.2] then demand = (7.92,11.9]
## If Time = (2.2,3.4]   then demand = (15.8,19.8]
## If Time = (3.4,4.6]   then demand = (15.8,19.8]
## If Time = (4.6,5.8]   then demand = (-0.0198,3.96] <----------------------------
## If Time = (5.8,7.01]  then demand = (15.8,19.8]
##
## Accuracy:
## 6 of 6 instances classified correctly (100%)

Although there exist examples of non-monotonic dose-response curves usually a higher dose leads to a higher response.

Another example, is valuation. Suppose we want to predict the price of a house based on number of bedrooms and other predictors. More bedrooms should lead to a higher valuation.

Usually it is sufficient to guarantee monotonicity without specifying the direction and if the direction is opposite from expected then we can re-examine our assumptions or reject that predictor. A simple way to ensure monotonicity is to allow only one split on each predictor which I am assuming would be easy to implement.

ggrothendieck avatar Mar 11 '23 19:03 ggrothendieck

Ok, if I understand you correctly, your question is about the number of splits of the predictors. This should be quite easy to achieve by using the bin() function with the nbins argument, specifying the number of bins before you use the OneR() function on the resulting dataframe (please also consult the documentation for bin() for an example). So, if you only want one split per predictor you should set nbins = 2.

Please try this and come back to tell me if it solved your problem. Thank you

vonjd avatar Mar 11 '23 21:03 vonjd

That splits each column into two but it doesn't seem to do it optimally. If we try the previous example it gets 2 predictions wrong but it would be possible to get 0 wrong if the splits were different.

BOD[5, 2] <- 0
BOD2 <- bin(BOD, 2)
OneR(demand ~ Time, BOD2)
##
## Call:
## OneR.formula(formula = demand ~ Time, data = BOD2)
##
## Rules:
## If Time = (0.994,4] then demand = (9.9,19.8]
## If Time = (4,7.01]  then demand = (-0.0198,9.9]
##
## Accuracy:
## 4 of 6 instances classified correctly (66.67%)  <-------------------------------

ggrothendieck avatar Mar 12 '23 17:03 ggrothendieck

Just a follow up to the last post. We compute all possible increasing splits into 2 and find that the split Time > 5 demand > 19 gives 0 wrong predictions.

BOD[5,2] <- 0

noWrong <- function(i,j) {
 x <- BOD$Time > BOD$Time[i]
 y <- BOD$demand > BOD$demand[j]
 if (all(x == y)) cat("i=", i, "j=", j, 
   "Time >", BOD$Time[i], "demand >", BOD$demand[j], 
    "x=y=", paste(+x, collapse = ""), "\n")
 sum(x != y)
}
 
BOD
##   Time demand
##  1    1    8.3
##  2    2   10.3
##  3    3   19.0
##  4    4   16.0
##  5    5    0.0
##  6    7   19.8

outer(1:5, 1:5, Vectorize(noWrong))
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1    2    4    3    2
##  [2,]    2    1    3    2    3
##  [3,]    3    2    2    3    4
##  [4,]    4    3    1    2    5
##  [5,]    3    2    0    1    4

noWrong(5, 3)
## i= 5 j= 3 Time > 5 demand > 19 x=y= 000001 
## [1] 0

ggrothendieck avatar Mar 13 '23 09:03 ggrothendieck