SPORF
SPORF copied to clipboard
gini versus `what is this?`
These two statistics for node impurity seem very similar (up to a shift?): I'm not sure if there is a benefit to using one over the other, @jovo?
https://github.com/neurodata/RerF/blob/c4d602cd4d763dc728bb48e2cf84114638d9f074/packedForest/src/forestTypes/binnedTree/inNodeClassTotals.h#L60-L77
@falkben thoughts?
gini is a fraction in [0,1] and the other thing is a percentage in [0,100]?
dunno.
@jbrowne might know?
On Wed, Apr 3, 2019 at 4:46 PM Jesse Leigh Patsolic < [email protected]> wrote:
gini is a fraction in [0,1] and the other thing is a percentage in [0,100]?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/neurodata/RerF/issues/229#issuecomment-479651186, or mute the thread https://github.com/notifications/unsubscribe-auth/AACjchR-v7JxUbseMren9SDkfqj8L4xGks5vdRJvgaJpZM4cbWDL .
-- the glass is all full: half water, half air. neurodata.io
@MrAE, Jovo and Tyler spent an hour convincing me that if you divide the bottom procedure by the number of objects you get the exact same result as the top procedure. I made a google sheet to demonstrate.
https://docs.google.com/spreadsheets/d/1v93r0I-FkHt-kpHIn33Owv7fhW_z8i4neGrjWsyaW2U/edit?usp=sharing
We could divide by the number of objects here and return the real gini, but then in the next step, when you come up with a final impurity score for the node, you multiply by the number of objects. So we would divide the numobs out just to multiply numobs in. This is how I rationalized it -- could be wrong though.
The top procedure looks much faster than the bottom though, so maybe using it instead and multiplying by numobs later could be faster.