keras-preprocessing icon indicating copy to clipboard operation
keras-preprocessing copied to clipboard

Tokenizer.fit_on_texts should have a mode option to select tokens via tf-idf scores

Open rjurney opened this issue 5 years ago • 0 comments

I want to add a mode argument to Tokenizer.fit_on_texts to support tf-idf when filtering tokens when a limit is specified.

Tokenizer currently uses frequency to select the top num_words to keep for the tokenization when given num_words as an argument. This selects stop words over words that are markers for different documents and seems far from ideal for many (most?) applications. TF-IDF would give better results because it would select the most relevant tokens for the reset of the pipeline rather than those that appear in most documents.

  • [x] Check that you are up-to-date with the master branch of keras-preprocessing. You can update with: pip install git+git://github.com/keras-team/keras-preprocessing.git --upgrade --no-deps

  • [ x] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

rjurney avatar Oct 08 '19 22:10 rjurney