Support Multiple Weighting Schemes
Now each term is represented as an indicator in the sparse matrix, which means each term in each document is weighted uniformly.
Add support for more schemes -- frequency, relative frequency, tf-idf, etc.
how would you model that? add a weight to each term that could be manipulated?
@kristofer Currently, each term in each document contributes the same amount of information as any term in any other document.
A different weighting scheme, such as tf-df, could penalize terms that occur frequently in every document — like stop words, even though many English stop words are explicitly filtered from the classifier.
That would require creating additional matrix representations of the training corpus, since the current sparse representation can't support different weighting schemes. For me, this issue has always been a "nice to have", but PRs are always welcome!