"ROCR currently supports only evaluation of binary classification tasks" in PredictionList() with random forest model classifier

Open xameno opened this issue 5 years ago • 0 comments

Hi everybody,

I've traced an issue when trying the function PredictionList after running it for a random forest model of classification type.

First of all, I think that the correct way to represent the absence--presence of a land type in a pixel as a response variable is as a factor for the glm, rpart and random forest predictive models. On the contrary, in the pie example of the package, a land use type is given as numeric to the predictive models. Thing is that in the three predictive models' respective packages it is specified that in cases as with the 0--1 absence--presence of a specific land type in each pixel, this response variable must be specified as a factor. For example, if you specify the presence of a land type as a factor variable, the random forest model is run as of classification type, returning also a 2 x 2 confusion matrix. This is not the case in the pie example documented in the package, in which a regression type random forest model is built, which falsely takes the response variable as continuous, also returning different results.

Now to the issue, if you specify a land type presence response variable as factor, and you build a random forest model, the PredictionList function returns an error "ROCR currently supports only evaluation of binary classification tasks". The PredictionList function initially runs a predict function with signature "PredictiveModelList" returning "0" or "1" labels of char type predicted by the random forest model. Note that if you do it with the glm or rpart models, it will return numeric probabilities. Now, the "0"--"1" labels of type char in the PredictionList function are fed to the ROCR::prediction function which compares them with the 0--1 numeric values of the test data. I suppose then that the error arises because the two vectors are of different type, and hence the 0--1 labels cannot be compared since they are taken as different labels (two labels from the char type, and two other labels from the numeric type).

I can think of two immediate workarounds: (1) edit the predict function with signature "PredictiveModelList" for the random forest models, to return numeric probabilities as for the glm and rpart models, instead of returning 0--1 char type labels; or (2) change the type of the 0--1 labels from char to numeric inside the PredictionList function.

I suggest using the first workaround (1), not only in order to return probabilities for the random forest models, as with the glm and rpart, but also because this predict function with signature "PredictiveModelList" can be used elsewhere. For example, as is now, you can't use this function for making suitability maps with random forest models of classification type, since it returns 0--1 values only, but you need probabilities for the maps.

The only edit needed in the predict function with signature "PredictiveModelList" is to replace this block:

if (inherits(mod, "randomForest")) {
    out[[i]] <- predict(object = mod, newdata = newdata, 
                   type = "response", ...)
}

with this block

if (inherits(mod, "randomForest")) {
    out[[i]] <- predict(object = mod, newdata = newdata, 
                   type = "prob", ...)[, 2] 
}

I hope this helps!

May 11 '20 18:05 xameno