Composite option, dataString compression, and correct splits percentage
Hi @topepo! I've been working on a port of your code into Python (I believe Kirk mentioned that) over here: https://github.com/pjaselin/Cubist. Thank you so much for all the work you are your colleagues have done on this!
Some improvements/fixes I've made here:
- Composite option: in one parameter you can choose whether to use instance-based correction or let Cubist decide. I think this is helpful because instance-based correction adds a little opacity to the prediction process.
- Moved number of nearest neighbors to cubistControl: I understand your intent was to change the number of neighbors at predict time since that's where it comes into use but I wonder if it makes more sense to be part of training as it is used in model evaluation.
- dataString compression: I have the dataString compressed so the model has a smaller memory footprint when using instance-based correction.
- Correct splits percentage: I noticed that you had hardcoded "<=" at https://github.com/topepo/Cubist/blob/548ccd7b75b97e7f263c750c2bdeaa5e4a4c2e5e/R/cubist.R#L212 so I came up with a way to get the right comparison operator based on the model. (This is probably the one definite fix here)
Let me know if you'd like to break this apart and I'd be happy to take feedback!
Also, giving the composite option should help accelerate prediction times for cases like #28
Thanks for doing this; it's great to have an outside contribute.
I've been looking at it for a few days and I'd like to tweak the ui. It's a little awkward to have an argument that could be logical or character. How about we
- control the composite/model-only decision using the
neighborsargument (values > 0 are composite)? - control the auto/manual decision with
auto = TRUE/FALSE?
I think that would get us to the same place. Also, the current specification breaks a lot of existing analyses (in books, vignettes, and so on). I'd like it to be backward compatible and defaulting to auto = FALSE would do that.
Hi @topepo, I really appreciate your feedback and ideas here. Also thank you for allowing me to contribute! This is a much cleaner UI and I think I'll modify my Python implementation to match it. Changes made:
- auto=FALSE: decides whether to allow Cubist to decide whether to use composite models
- neighbors=NA: decides whether to use composite models
- cv=NA: run k-fold cross-validation (this is the last of my improvements in the Python code if you'd like to use it here, note that no model is returned by Cubist)
Also definitely please make sure this all passes your tests beyond mine!