tf-keras
tf-keras copied to clipboard
Tokenizer converts padding integers to OOV when oov_token is not None
System information.
TensorFlow version (you are using): 2.4.1 (also re-produced in 2.6) Are you willing to contribute it (Yes/No) : No
Describe the feature and the current behavior/state.
When used with padding, Tokenizer.sequences_to_texts() converts padding tokens to oov_token when oov_token is not None. This does not happen when oov_token = None, so sequences_to_texts() function skips padding integers as well as oov integers.
This behaviour is perhaps expected since padding value is not part of the vocabulary.
However I think it would make more sense if sequences_to_texts() function takes an optional padding_value argument and does not encode back these integers as oov_token.
To produce:
import tensorflow as tf
vocab_size = 5
seq_len = 5
text = "hello world test"
oov_token = "<OOV>"
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words = vocab_size, oov_token = oov_token)
tokenizer.fit_on_texts([text])
tokenized = tokenizer.texts_to_sequences([text])
padded = tf.keras.preprocessing.sequence.pad_sequences(tokenized, maxlen = seq_len, value = 0)
print('Non padded tokenization result:', tokenized)
print("Non padded de-tokenization result:", tokenizer.sequences_to_texts(tokenized))
print("\n")
print('Padded tokenization result:', padded)
print("Padded de-tokenization result:", tokenizer.sequences_to_texts(padded))
Non padded tokenization result: [[2, 3, 4]]
Non padded de-tokenization result: ['hello world test']
Padded tokenization result: [[0 0 2 3 4]]
Padded de-tokenization result: ['<OOV> <OOV> hello world test']
What it will de-tokenize to with this feature implemented:
Feature implemented padded de-tokenization result: ['hello world test']
Will this change the current api? How?
tf.keras.preprocessing.text.Tokenizer.sequences_to_texts() function will take an optional padding_value argument, which is None by default.
Who will benefit from this feature?
Those who use tf.keras.text.Tokenizer to tokenize strings with padding, and de-tokenize padded sequences to words.
- Do you want to contribute a PR? (yes/no): No
@ymodak Was able to reproduce the issue in TF v2.6 , TF v2.5 , TF-nightly please find the gist here.Thanks!
I was able to reproduce the issue on tf-nightly.Kindly find the gist of it here.