rpart
rpart copied to clipboard
separating interfaces
Would you be open to a PR for refactoring rpart
intorpart.formula
and rpart.default
methods? I think I may have mentioned this back in January at the Rstudio conference.
I’ll discuss this with Terry and get back to you.
Beth
From: Max Kuhn [mailto:[email protected]] Sent: Tuesday, August 14, 2018 9:30 AM To: bethatkinson/rpart Cc: Subscribed Subject: [EXTERNAL] [bethatkinson/rpart] separating interfaces (#1)
Would you be open to a PR for refactoring rpart intorpart.formula and rpart.default methods? I think I may have mentioned this back in January at the Rstudio conference.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/bethatkinson/rpart/issues/1, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK0Ddw9xNO7UJRbW43nFHAdt0Ac-0mUkks5uQt7jgaJpZM4V8jKA.
Thanks. Just to be clear, my goals would be to avoid process the formula calculations (especially if a suitable data set consistent with rpart.matrix
's output exists). For bagging and some other ensembles, a lot of the final model footprint contains redundant terms and formula-related components.
I can see that the change from function to generic for rpart
might be problematic. Exposing rpart.matrix
and internal functions below that call would allow us to reproduce rpart
internals. That might be a good fallback if changing rpart
to be an S3 wouldn't work (but this approach is not preferable for obvious reasons).
Beth and Max,
This is exactly what is done by the division between coxph and coxph.fit, survreg and survreg.fit, etc in the survival package. That has been a successful strategy to allow others access to the computations without using formulas. I prefer doing it this way rather than creating a rpart method along with rpart.formula and rpart.default since a. the method doesn't buy us anything more and b. the rpart.fit strategy works better wrt documentation and .Rd files. It is easier for us, and more importantly easier for the user, who won't need the "rpart:::" secret handshake to find the routines.
So yes, I agree we should go ahead. The primary decisions will be exactly where to split it (how little or much should the formula part do), and how many of the error checks to replicate. The coxph.fit routine assumes for instance that the data is good, e.g. X and y match in dimension, as it will normally be called by a front end routine that has taken care of such checks.
Terry Therneau
I think that would be fine.
I would suggest exporting rpart.matrix and having rpart.fit start right after that.
My reasoning is that people should only have to call model.matrix (contained in rpart.matrix) once so that that potential inefficacy (for large $p$) might be avoided.
However, I’m not the expert here and there are some terms and similar objects that are needed.
Let me know if I can help.
Thanks,
Max