Cubist icon indicating copy to clipboard operation
Cubist copied to clipboard

Composite option, dataString compression, and correct splits percentage

Open pjaselin opened this issue 4 years ago • 3 comments

Hi @topepo! I've been working on a port of your code into Python (I believe Kirk mentioned that) over here: https://github.com/pjaselin/Cubist. Thank you so much for all the work you are your colleagues have done on this!

Some improvements/fixes I've made here:

  • Composite option: in one parameter you can choose whether to use instance-based correction or let Cubist decide. I think this is helpful because instance-based correction adds a little opacity to the prediction process.
  • Moved number of nearest neighbors to cubistControl: I understand your intent was to change the number of neighbors at predict time since that's where it comes into use but I wonder if it makes more sense to be part of training as it is used in model evaluation.
  • dataString compression: I have the dataString compressed so the model has a smaller memory footprint when using instance-based correction.
  • Correct splits percentage: I noticed that you had hardcoded "<=" at https://github.com/topepo/Cubist/blob/548ccd7b75b97e7f263c750c2bdeaa5e4a4c2e5e/R/cubist.R#L212 so I came up with a way to get the right comparison operator based on the model. (This is probably the one definite fix here)

Let me know if you'd like to break this apart and I'd be happy to take feedback!

pjaselin avatar Dec 27 '21 18:12 pjaselin

Also, giving the composite option should help accelerate prediction times for cases like #28

pjaselin avatar Dec 27 '21 22:12 pjaselin

Thanks for doing this; it's great to have an outside contribute.

I've been looking at it for a few days and I'd like to tweak the ui. It's a little awkward to have an argument that could be logical or character. How about we

  • control the composite/model-only decision using the neighbors argument (values > 0 are composite)?
  • control the auto/manual decision with auto = TRUE/FALSE?

I think that would get us to the same place. Also, the current specification breaks a lot of existing analyses (in books, vignettes, and so on). I'd like it to be backward compatible and defaulting to auto = FALSE would do that.

topepo avatar Feb 04 '22 19:02 topepo

Hi @topepo, I really appreciate your feedback and ideas here. Also thank you for allowing me to contribute! This is a much cleaner UI and I think I'll modify my Python implementation to match it. Changes made:

  • auto=FALSE: decides whether to allow Cubist to decide whether to use composite models
  • neighbors=NA: decides whether to use composite models
  • cv=NA: run k-fold cross-validation (this is the last of my improvements in the Python code if you'd like to use it here, note that no model is returned by Cubist)

Also definitely please make sure this all passes your tests beyond mine!

pjaselin avatar Feb 08 '22 04:02 pjaselin