multibayes Support Multiple Weighting Schemes

Now each term is represented as an indicator in the sparse matrix, which means each term in each document is weighted uniformly.

Add support for more schemes -- frequency, relative frequency, tf-idf, etc.

Dec 23 '14 23:12 drewlanenga

how would you model that? add a weight to each term that could be manipulated?

Jul 21 '17 14:07 kristofer

@kristofer Currently, each term in each document contributes the same amount of information as any term in any other document.

A different weighting scheme, such as tf-df, could penalize terms that occur frequently in every document — like stop words, even though many English stop words are explicitly filtered from the classifier.

That would require creating additional matrix representations of the training corpus, since the current sparse representation can't support different weighting schemes. For me, this issue has always been a "nice to have", but PRs are always welcome!

Jul 24 '17 13:07 drewlanenga