multibayes icon indicating copy to clipboard operation
multibayes copied to clipboard

Support Multiple Weighting Schemes

Open drewlanenga opened this issue 11 years ago • 2 comments

Now each term is represented as an indicator in the sparse matrix, which means each term in each document is weighted uniformly.

Add support for more schemes -- frequency, relative frequency, tf-idf, etc.

drewlanenga avatar Dec 23 '14 23:12 drewlanenga

how would you model that? add a weight to each term that could be manipulated?

kristofer avatar Jul 21 '17 14:07 kristofer

@kristofer Currently, each term in each document contributes the same amount of information as any term in any other document.

A different weighting scheme, such as tf-df, could penalize terms that occur frequently in every document — like stop words, even though many English stop words are explicitly filtered from the classifier.

That would require creating additional matrix representations of the training corpus, since the current sparse representation can't support different weighting schemes. For me, this issue has always been a "nice to have", but PRs are always welcome!

drewlanenga avatar Jul 24 '17 13:07 drewlanenga