machine-learning-articles icon indicating copy to clipboard operation
machine-learning-articles copied to clipboard

Tokenization and Text Data Preparation with TensorFlow & Keras

Open khuyentran1401 opened this issue 4 years ago • 0 comments

TL;DR

This article looks at how to tokenize and prepare text data with TensorFlow and Keras preprocessing tools

Article Link

https://www.kdnuggets.com/2020/03/tensorflow-keras-tokenization-text-data-prep.html

Author

Matthew Mayo

Key Takeaways

Useful Code Snippets

# Tokenize our training data
tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)
tokenizer.fit_on_texts(train_data)

# Get our training data word index
word_index = tokenizer.word_index

# Encode training data sentences into sequences
train_sequences = tokenizer.texts_to_sequences(train_data)

# Get max training sequence length
maxlen = max([len(x) for x in train_sequences])

# Pad the training sequences
train_padded = pad_sequences(train_sequences, padding=pad_type, truncating=trunc_type, maxlen=maxlen)

Useful Tools

  • Keras
  • TensorFlow

Comments/ Questions

khuyentran1401 avatar Apr 10 '20 18:04 khuyentran1401