apparatus icon indicating copy to clipboard operation
apparatus copied to clipboard

Is smoothing really needed for prob calc in bayes_classifier?

Open jiabinf opened this issue 8 years ago • 4 comments

Thanks for creating NaturalNode!

I am using your Bayes Classifier in my project, when looking into the implementation, I found it adds smoothing when calculating the probabilities.

This smoothing on unknown words in test set will cause probability to be skewed towards whichever class has the least amount of features. For instance:

say smoothing === 1, class A has 2 features, class B has 3, (0 + 1) / 2 is bigger than (0 + 1) / 3, A also wins.

I understand it may be good to have smoothing in training set, but is it really necessary for test set? Why not just discarding the tokens which are not in classFeatures[label]?

    while(i--) {
        if(observation[i]) {
            var count = this.classFeatures[label][i] || this.smoothing;
            // numbers are tiny, add logs rather than take product
            prob += Math.log(count / this.classTotals[label]);
        }
    }

jiabinf avatar Oct 18 '16 08:10 jiabinf

Rephased my question, also found some discussion here: http://stats.stackexchange.com/a/108990

jiabinf avatar Oct 18 '16 18:10 jiabinf

Hi, it is correct that if you are evaluating a single unknown feature the system will always pick the same class for it. In general it is the majority class, but as you point out, smoothing might change that.

In general, Laplacian smoothing is a very poor smoothing technique but smoothing in general hinges in the how much probability mass allocate to unseen events. Don't use it during test seems to miss the point, but in case of a very poor smoothing algorithm, you might get ahead, yes. I hope to add Good Turing smoothing in some moment (and don't cry too much about it: www.csie.ntu.edu.tw/~b92b02053/print/good-turing-smoothing-without.pdf )

By the way, you can disable smoothing by setting epsilon to zero when using the classifier (in test mode).

If this answers your question, consider closing this bug as it does not affect the implementation of the algorithms in the code.

DrDub avatar Nov 03 '16 17:11 DrDub

@DrDub thanks for your reply. I set smoothing to 0.01 and keep different training set balanced, overall it works well.

Still, looking forward to the new Good Turing smoothing. 👍

jiabinf avatar Dec 27 '16 07:12 jiabinf

@DrDub +1 for the PDF. It looks very interesting, I'll read it as soon as I have some free time.

ghost avatar Mar 22 '18 05:03 ghost