Tokenizer.fit_on_text splits 1 string into chars when char_level=False
From: https://github.com/keras-team/keras/issues/10768 by @hadaev8
Tokenizer will fit/transform the string into chars if a string is provided to fit_on_texts/text_to_sequences methods regardless of char_level setting. This is happening because the method expects a list of strings and is splitting the string into chars if just 1 string is given in this line for fitting:
https://github.com/keras-team/keras-preprocessing/blob/e002ebd40e888965686e8946acefe02f5a910576/keras_preprocessing/text.py#L205
and this one for trasnforming: https://github.com/keras-team/keras-preprocessing/blob/e002ebd40e888965686e8946acefe02f5a910576/keras_preprocessing/text.py#L293
Reproducible code illustrating the problem with fit_on_texts:
from keras.preprocessing.text import Tokenizer
text='check check fail'
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text)
tokenizer.word_index
Output:
{'c': 1, 'h': 2, 'e': 3, 'k': 4, 'f': 5, 'a': 6, 'i': 7, 'l': 8}
wrapping text into list solves the issue
tokenizer.fit_on_texts([text])
tokenizer.word_index
{'check': 1, 'fail': 2}
I can recommend checking that text is a list of strings and if it is not producing a warning and wrapping it into the list or erroring out
Thanks for the tip