One-Hot Encoding
In section '6.1.1 One-hot encoding of words and characters' (as well as section '3.4.2 Preparing the data'), the encoding produced by 'Listing 6.3 Using Keras for word-level one-hot encoding' does not appear to be one-hot encoding as described in Listing 6.1 and 6.2.
The code from Listing 6.1 produces this encoding:
array([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])
The code for 6.3 produces this encoding:
array([[0., 1., 1., 1., 1., 1., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 1., 1., 1., 1.]])
Why are they different?
Thanks!
do you mean https://keras.io/preprocessing/text/#text_to_word_sequence vs doing it manually?
Yes! The output of manually one-hot encoding a sequence (Listing 6.1 and 6.2) is a vector for each word in the sequence, with a single hot entry (i.e., a 1) in each vector; which is what the book describes one-hot encoding as.
array([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
The output of the Keras tokenizer (listing 6.3), on the other hand, is a single vector for each sequence and there is more than a single hot entry in this vector.
array([[0., 1., 1., 1., 1., 1., 0., 0., 0., 0.]])
I don't understand why Listing 6.3 produces an output that does not appear to be one-hot encoding. It looks to me like Listing 6.3 produces a vector that indicates the presence of words in a sequence, but loses the order of the words in that sequence.
Thank you for responding, morenoh149!
to_categorical() method returns an np array of type int32 The to_one_hot() method needs to be modified to match the output: def to_one_hot(labels, dimension = 46): results = np.zeros((len(labels), dimension)).astype(np.int32) for i, label in enumerate(labels): results[i,label] = 1 return results
I think the code in chapter 3.4.2 (listing 3.2) does not produce one-hot-encoded data. In NLP there is a general understanding that one-hot-encoding maps a word to a vector of the size n, where n is the size of the vocabulary. All values in the vector are zero but one value, which is 1, index position pointing to the word.* A one-hot encoded text is represented as a matrix, where each column or row is a one-hot-encoded word. In the book DLwP and in the preprocessing modul for text in Keras one-hot-encoding refers to something different. Listing 3.2 creates - so my understanding - a binarized bag-of-word representation of the text. The text is represented by a vector instead of a matrix. The order of the words is obviously lost.
The understanding of one-hot-encoding in Keras is different again, but also not the general understanding as outlined above. keras.preprocessing.text.one_hot() returns a vector for a text where each word is replaced by an integer, which is an index for the word, so if your vocabulary has the size of 10.000, then that integer can be as large.
If you want to one-hot encode with Keras, you can use this combination. The first line substitutes each word with an integer (n = size of vocab) and to_categorical() creates the one-hot-encoding.
t = keras.preprocessing.text.hashing_trick(text, n)
keras.utils.to_categorical(t)
*see Wikipedia on 'one-hot' or the lecture notes on R. Sochers NLP course https://cs224d.stanford.edu/lecture_notes/LectureNotes1.pdf