keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

No tokenizer option to add special tokens (`[MASK]`, `[PAD]`) inside string input

Open admiraltidebringer opened this issue 1 year ago • 10 comments
trafficstars

Describe the bug Bert Tokenizer can't tokenize [MASK] token. it should return 103. but it returns 1031, 7308, 1033.

Proof keras_nlp library: keras_nlp.models.BertTokenizer.from_preset('bert_tiny_en_uncased', sequence_length=12)(['i am going to [MASK] to study math', 'the day before we went to [MASK] to cure illness']) result: <tf.Tensor: shape=(2, 12), dtype=int32, numpy= array([[1045, 2572, 2183, 2000, 1031, 7308, 1033, 2000, 2817, 8785, 0, 0], [1996, 2154, 2077, 2057, 2253, 2000, 1031, 7308, 1033, 2000, 9526, 7355]], dtype=int32)>

hugging face: from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Serdarmuhammet/bert-base-banking77") tokenizer(['i am going to [MASK] to study math', 'the day before we went to [MASK] to cure illness'], return_tensors='tf', max_length=20, padding=True)['input_ids'] result: <tf.Tensor: shape=(2, 12), dtype=int32, numpy= array([[ 101, 1045, 2572, 2183, 2000, 103, 2000, 2817, 8785, 102, 0, 0], [ 101, 1996, 2154, 2077, 2057, 2253, 2000, 103, 2000, 9526, 7355, 102]], dtype=int32)>

admiraltidebringer avatar Jan 08 '24 12:01 admiraltidebringer