handson-ml2 icon indicating copy to clipboard operation
handson-ml2 copied to clipboard

[IDEA] Use of TextVectorization layer for sentimental analysis in chapter 16

Open eisthf opened this issue 4 years ago • 1 comments

The following code snippet illustrates preprocessing and encoding of texts and use the result to train GRU models for sentimental analysis in chapter 16.

embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, steps_per_epoch=train_size // 32, epochs=5)

It would be great to add how to use keras.layers.experimental.preprocessing.TextVectorization for the same goal. For that, I tried the following code snippet. But the problem in this case, the training becomes too slow:

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vectorize_layer = TextVectorization(max_tokens=vocab_size+1, output_mode='int')
vectorize_layer.adapt(truncated_vocabulary)

embed_size = 128
model = keras.models.Sequential([
    vectorize_layer,
    keras.layers.Embedding(vocab_size + 1, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(datasets["train"].batch(32).prefetch(1), steps_per_epoch=train_size // 32, epochs=5)

Any advice please?

eisthf avatar Jun 06 '21 04:06 eisthf

Thanks for your suggestion @eisthf. Preprocessing layers did not exist yet when I wrote the second edition, they were just rough specs so I had to do my best to guess what they look like in the end (with the kind help of François Chollet, though). That's why this section of the book is not very detailed. But I plan to go into more depth in the 3rd edition (coming out at the end of 2022).

Your code is pretty good, I wouldn't change much. Just this:

  • I would pass ragged=True when constructing the TextVectorization layer. This avoids the need for masking, which simplifies things, and it may also speed things a little bit.
  • When calling adapt(), instead of passing it the truncated vocabulary, I would give it a large enough sample of the training set so that it can figure out what the vocabulary should be on its own. But I'd give it only the text, not the labels. So it would look like this:
    vectorize_layer.adapt(datasets["train"].map(lambda X, y: X).batch(32).take(10000))
    
    In this case, 10,000 batches of 32 samples each is larger than the training set, but if the dataset was huge, we would be happy to avoid going through all of it just to construct the vocabulary. A sample of 10,000 batches should be enough to capture most of the important words.
  • I would remove the steps_per_epoch=train_size // 32 argument, as this was just a workaround that was needed in TF 2.0 because there were some issues with the way model.fit() handled dataset shuffling across epochs (it would repeat the same order at every epoch, so it was preferable to let the dataset be infinite using repeat() and tell fit() the number of steps per epoch; but this is fixed now).

Here's the code:

vectorize_layer = keras.layers.TextVectorization(max_tokens=vocab_size+1, output_mode='int', ragged=True)
vectorize_layer.adapt(datasets["train"].map(lambda X, y: X).batch(32).take(1000))

embed_size = 128
model = keras.models.Sequential([
    vectorize_layer,
    keras.layers.Embedding(vocab_size + 1, embed_size),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(datasets["train"].batch(32).prefetch(1), epochs=5)

That said, I just tested it, and indeed it looks like it's an order of magnitude slower than using the preprocess() function, which preprocesses the text using tf.strings.substr(), tf.strings.regex_replace() and tf.strings.split(). It may just be because TextVectorization is much more sophisticated, but it does seem a bit surprising.

I'll do a bit of digging on this. And of course if you find something, I'd love to know.

ageron avatar Oct 07 '21 10:10 ageron