text
text copied to clipboard
Is `preserve_unused_token` working when calling `bert_vocab_from_dataset`?
If I got it right, then this command:
bert_vocab.bert_vocab_from_dataset(
dataset=tf.data.Dataset.from_tensor_slices(['I am [unused1].']),
vocab_size=100,
reserved_tokens=[],
bert_tokenizer_params=dict(lower_case=False, preserve_unused_token=True),
learn_params = None
)
should return [unused1] as a token, but rather I get
['.', '1', 'I', '[', ']', 'a', 'd', 'e', 'm', 'n', 's', 'u', '##.', '##1', '##I', '##[', '##]', '##a', '##d', '##e', '##m', '##n', '##s', '##u']
which is exactly what was expected if preserve_unused_token was being ignored.
Am I doing something wrong?
it looks like you are using it correctly and this may be a bug. We'll take a look.