keras-preprocessing
keras-preprocessing copied to clipboard
Tokenizer.fit_on_texts should have a mode option to select tokens via tf-idf scores
I want to add a mode
argument to Tokenizer.fit_on_texts to support tf-idf when filtering tokens when a limit is specified.
Tokenizer currently uses frequency to select the top num_words
to keep for the tokenization when given num_words
as an argument. This selects stop words over words that are markers for different documents and seems far from ideal for many (most?) applications. TF-IDF would give better results because it would select the most relevant tokens for the reset of the pipeline rather than those that appear in most documents.
-
[x] Check that you are up-to-date with the master branch of keras-preprocessing. You can update with:
pip install git+git://github.com/keras-team/keras-preprocessing.git --upgrade --no-deps
-
[ x] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).