rpart icon indicating copy to clipboard operation
rpart copied to clipboard

separating interfaces

Open topepo opened this issue 6 years ago • 3 comments

Would you be open to a PR for refactoring rpart intorpart.formula and rpart.default methods? I think I may have mentioned this back in January at the Rstudio conference.

topepo avatar Aug 14 '18 14:08 topepo

I’ll discuss this with Terry and get back to you.

Beth

From: Max Kuhn [mailto:[email protected]] Sent: Tuesday, August 14, 2018 9:30 AM To: bethatkinson/rpart Cc: Subscribed Subject: [EXTERNAL] [bethatkinson/rpart] separating interfaces (#1)

Would you be open to a PR for refactoring rpart intorpart.formula and rpart.default methods? I think I may have mentioned this back in January at the Rstudio conference.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/bethatkinson/rpart/issues/1, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK0Ddw9xNO7UJRbW43nFHAdt0Ac-0mUkks5uQt7jgaJpZM4V8jKA.

bethatkinson avatar Aug 14 '18 18:08 bethatkinson

Thanks. Just to be clear, my goals would be to avoid process the formula calculations (especially if a suitable data set consistent with rpart.matrix's output exists). For bagging and some other ensembles, a lot of the final model footprint contains redundant terms and formula-related components.

I can see that the change from function to generic for rpart might be problematic. Exposing rpart.matrix and internal functions below that call would allow us to reproduce rpart internals. That might be a good fallback if changing rpart to be an S3 wouldn't work (but this approach is not preferable for obvious reasons).

topepo avatar Aug 16 '18 14:08 topepo

Beth and Max,

This is exactly what is done by the division between coxph and coxph.fit, survreg and survreg.fit, etc in the survival package. That has been a successful strategy to allow others access to the computations without using formulas. I prefer doing it this way rather than creating a rpart method along with rpart.formula and rpart.default since a. the method doesn't buy us anything more and b. the rpart.fit strategy works better wrt documentation and .Rd files. It is easier for us, and more importantly easier for the user, who won't need the "rpart:::" secret handshake to find the routines.

So yes, I agree we should go ahead. The primary decisions will be exactly where to split it (how little or much should the formula part do), and how many of the error checks to replicate. The coxph.fit routine assumes for instance that the data is good, e.g. X and y match in dimension, as it will normally be called by a front end routine that has taken care of such checks.

Terry Therneau


I think that would be fine.

I would suggest exporting rpart.matrix and having rpart.fit start right after that.

My reasoning is that people should only have to call model.matrix (contained in rpart.matrix) once so that that potential inefficacy (for large $p$) might be avoided.

However, I’m not the expert here and there are some terms and similar objects that are needed.

Let me know if I can help.

Thanks,

Max

bethatkinson avatar Oct 06 '21 21:10 bethatkinson