tf-keras icon indicating copy to clipboard operation
tf-keras copied to clipboard

Tokenizer converts padding integers to OOV when oov_token is not None

Open meliksahturker opened this issue 4 years ago • 2 comments
trafficstars

System information.

TensorFlow version (you are using): 2.4.1 (also re-produced in 2.6) Are you willing to contribute it (Yes/No) : No

Describe the feature and the current behavior/state.

When used with padding, Tokenizer.sequences_to_texts() converts padding tokens to oov_token when oov_token is not None. This does not happen when oov_token = None, so sequences_to_texts() function skips padding integers as well as oov integers.

This behaviour is perhaps expected since padding value is not part of the vocabulary. However I think it would make more sense if sequences_to_texts() function takes an optional padding_value argument and does not encode back these integers as oov_token.

To produce:

import tensorflow as tf

vocab_size = 5
seq_len = 5

text = "hello world test"
oov_token = "<OOV>"

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words = vocab_size, oov_token = oov_token)
tokenizer.fit_on_texts([text])

tokenized = tokenizer.texts_to_sequences([text])
padded = tf.keras.preprocessing.sequence.pad_sequences(tokenized, maxlen = seq_len, value = 0)

print('Non padded tokenization result:', tokenized)
print("Non padded de-tokenization result:", tokenizer.sequences_to_texts(tokenized))
print("\n")
print('Padded tokenization result:', padded)
print("Padded de-tokenization result:", tokenizer.sequences_to_texts(padded))
Non padded tokenization result: [[2, 3, 4]]
Non padded de-tokenization result: ['hello world test']

Padded tokenization result: [[0 0 2 3 4]]
Padded de-tokenization result: ['<OOV> <OOV> hello world test']

What it will de-tokenize to with this feature implemented: Feature implemented padded de-tokenization result: ['hello world test']

Will this change the current api? How? tf.keras.preprocessing.text.Tokenizer.sequences_to_texts() function will take an optional padding_value argument, which is None by default.

Who will benefit from this feature? Those who use tf.keras.text.Tokenizer to tokenize strings with padding, and de-tokenize padded sequences to words.

Contributing

  • Do you want to contribute a PR? (yes/no): No

meliksahturker avatar Aug 26 '21 16:08 meliksahturker

@ymodak Was able to reproduce the issue in TF v2.6 , TF v2.5 , TF-nightly please find the gist here.Thanks!

kumariko avatar Sep 02 '21 03:09 kumariko

I was able to reproduce the issue on tf-nightly.Kindly find the gist of it here.

tilakrayal avatar May 09 '23 03:05 tilakrayal