Is `preserve_unused_token` working when calling `bert_vocab_from_dataset`?

Open dennymarcels opened this issue 3 years ago • 1 comments

If I got it right, then this command:

bert_vocab.bert_vocab_from_dataset(
    dataset=tf.data.Dataset.from_tensor_slices(['I am [unused1].']),
    vocab_size=100,
    reserved_tokens=[],
    bert_tokenizer_params=dict(lower_case=False, preserve_unused_token=True),
    learn_params = None
)

should return [unused1] as a token, but rather I get

['.', '1', 'I', '[', ']', 'a', 'd', 'e', 'm', 'n', 's', 'u', '##.', '##1', '##I', '##[', '##]', '##a', '##d', '##e', '##m', '##n', '##s', '##u']

which is exactly what was expected if preserve_unused_token was being ignored.

Am I doing something wrong?

Aug 16 '22 22:08 dennymarcels

it looks like you are using it correctly and this may be a bug. We'll take a look.

Aug 19 '22 01:08 broken