SPORF gini versus `what is this?`

These two statistics for node impurity seem very similar (up to a shift?): I'm not sure if there is a benefit to using one over the other, @jovo?

https://github.com/neurodata/RerF/blob/c4d602cd4d763dc728bb48e2cf84114638d9f074/packedForest/src/forestTypes/binnedTree/inNodeClassTotals.h#L60-L77

giniTest.Rmd giniTest.pdf

Apr 03 '19 20:04 MrAE

@falkben thoughts?

Apr 03 '19 20:04 MrAE

gini is a fraction in [0,1] and the other thing is a percentage in [0,100]?

Apr 03 '19 20:04 MrAE

dunno.

@jbrowne might know?

On Wed, Apr 3, 2019 at 4:46 PM Jesse Leigh Patsolic < [email protected]> wrote:

gini is a fraction in [0,1] and the other thing is a percentage in [0,100]?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/neurodata/RerF/issues/229#issuecomment-479651186, or mute the thread https://github.com/notifications/unsubscribe-auth/AACjchR-v7JxUbseMren9SDkfqj8L4xGks5vdRJvgaJpZM4cbWDL .

-- the glass is all full: half water, half air. neurodata.io

Apr 03 '19 21:04 jovo

@MrAE, Jovo and Tyler spent an hour convincing me that if you divide the bottom procedure by the number of objects you get the exact same result as the top procedure. I made a google sheet to demonstrate.

https://docs.google.com/spreadsheets/d/1v93r0I-FkHt-kpHIn33Owv7fhW_z8i4neGrjWsyaW2U/edit?usp=sharing

We could divide by the number of objects here and return the real gini, but then in the next step, when you come up with a final impurity score for the node, you multiply by the number of objects. So we would divide the numobs out just to multiply numobs in. This is how I rationalized it -- could be wrong though.

The top procedure looks much faster than the bottom though, so maybe using it instead and multiplying by numobs later could be faster.

Apr 04 '19 13:04 jbrowne6

SPORF SPORF copied to clipboard

gini versus `what is this?`

SPORF
SPORF copied to clipboard