rpart
rpart copied to clipboard
rpart has undefined behavior when a predictor has many categories.
Hi Beth!
Here's a little example:
library(mlbench)
library(rpart)
data("BostonHousing2", package = "mlbench")
model <- rpart::rpart(medv ~ town, data = BostonHousing2)
Here, town has 92 possible values.
If you run this code under an address sanitizer, you get the following error:
UndefinedBehaviorSanitizer: out-of-bounds-index rpart/src/bsplit.c:79:4
And here is a rough traceback:
rpart/src/bsplit.c:79:4: runtime error: index 20 out of bounds for type 'int [20]'
#0 rpart/src/bsplit.c:79:22
#1 rpart/src/partition.c:73:5
#2 rpart/src/rpart.c:218:5
#3 R/main/dotcode.c
I believe this is due to the design of the Split/ pSplit struct, which sets the arrays to have 20 values: https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/node.h#L10-L18
To reproduce the error, you'll need to compile R with sanitizers enabled. This is a little extreme, I would recommend trying a check with R-hub. https://cran.r-project.org/web/packages/rhub/vignettes/rhub.html
Or with one of the docker images here: https://github.com/rocker-org/rocker#unversioned-images-builds-on-r-base
Best wishes, Michael
Thank you for your detailed analysis of the problem. When rpart was developed, I don't think levels of 92 were anticipated. I'll add this to the list.
Beth
From: Michael Quinn [email protected] Sent: Friday, August 28, 2020 4:18 PM To: bethatkinson/rpart [email protected] Cc: Subscribed [email protected] Subject: [EXTERNAL] [bethatkinson/rpart] rpart has undefined behavior when a predictor has many categories. (#22)
Hi Beth!
Here's a little example:
library(mlbench) library(rpart)
data("BostonHousing2", package = "mlbench") model <- rpart::rpart(medv ~ town, data = BostonHousing2)
Here, town has 92 possible values.
If you run this code under an address sanitizer, you get the following error:
UndefinedBehaviorSanitizer: out-of-bounds-index rpart/src/bsplit.c:79:4
And here is a rough traceback:
rpart/src/bsplit.c:79:4: runtime error: index 20 out of bounds for type 'int [20]' #0 rpart/src/bsplit.c:79:22 #1 rpart/src/partition.c:73:5 #2 rpart/src/rpart.c:218:5 #3 R/main/dotcode.c
I believe this is due to the design of the Split/ pSplit struct, which sets the arrays to have 20 values: https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/node.h#L10-L18
To reproduce the error, you'll need to compile R with sanitizers enabled. This is a little extreme, I would recommend trying a check with R-hub. https://cran.r-project.org/web/packages/rhub/vignettes/rhub.html
Or with one of the docker images here: https://github.com/rocker-org/rocker#unversioned-images-builds-on-r-base
Best wishes, Michael
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/bethatkinson/rpart/issues/22, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWQG5Z6SDMHS6HHA3JPFODSDANI7ANCNFSM4QOQ3LZA.
I was not able to reproduce this error when using docker run -ti rocker/r-devel-san
(R 4.0.0). No error was shown. (I haven't used the sanitizer before, so let me know if there's something else I should look for.)
The size allocated for the structs is actually bigger than their size definition (where the fields only have size 20) and so the final field "grows". For the offending line https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/bsplit.c#L79 the struct on the left's size is (ncat
is the number of levels for this variable), https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/insert_split.c#L20
and the field on the right is allocated here (maxcat
is maximum number of levels across all variables), https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/rpart.c#L182
While a bit messy, it's not obvious the code is wrong.
Thanks Brian! I can confirm that this is still an error on R 3.6.3
Here's a bit of a traceback:
-- Test started at 2021-08-16 13:08:01 PDT --
-----------------------------------------------------------------------------
rpart/src/bsplit.c:79:4: runtime error: index 20 out of bounds for type 'int [20]'
#0 0x7fcb13397905 in bsplit rpart/src/bsplit.c:79:22
#1 0x7fcb1339f96d in partition rpart/src/partition.c:73:5
#2 0x7fcb133a619a in rpart rpart/src/rpart.c:218:5
#3 0x555d3a2137bb in R_doDotCall R/src/main/dotcode.c
#4 0x555d3a22103e in do_dotcall R/src/main/dotcode.c:1252:11
#5 0x555d3a2b4171 in bcEval R/src/main/eval.c:6765:14
#6 0x555d3a2a5e13 in Rf_eval R/src/main/eval.c:620:8
#7 0x555d3a2e34f6 in R_execClosure R/src/main/eval.c
#8 0x555d3a2dfafa in Rf_applyClosure R/src/main/eval.c:1706:16
#9 0x555d3a2a6716 in Rf_eval R/src/main/eval.c:743:12
#10 0x555d3a2ee049 in do_set R/src/main/eval.c:2807:8
#11 0x555d3a2a6305 in Rf_eval R/src/main/eval.c:695:12
#12 0x555d3a37293c in Rf_ReplIteration R/src/main/main.c:260:2
#13 0x555d3a3769e0 in R_ReplConsole R/src/main/main.c:310:11
#14 0x555d3a3767b8 in run_Rmainloop R/src/main/main.c:1103:5
#15 0x555d3a049bbd in main R/src/main/Rmain.c:30:5
SUMMARY: UndefinedBehaviorSanitizer: out-of-bounds-index rpart/src/bsplit.c:79:4
My test script in R is:
library(mlbench)
library(rpart)
data("BostonHousing2", package = "mlbench")
model <- rpart::rpart(medv ~ town, data = BostonHousing2)
Thanks again!