keras-nlp
keras-nlp copied to clipboard
No tokenizer option to add special tokens (`[MASK]`, `[PAD]`) inside string input
Describe the bug Bert Tokenizer can't tokenize [MASK] token. it should return 103. but it returns 1031, 7308, 1033.
Proof keras_nlp library: keras_nlp.models.BertTokenizer.from_preset('bert_tiny_en_uncased', sequence_length=12)(['i am going to [MASK] to study math', 'the day before we went to [MASK] to cure illness']) result: <tf.Tensor: shape=(2, 12), dtype=int32, numpy= array([[1045, 2572, 2183, 2000, 1031, 7308, 1033, 2000, 2817, 8785, 0, 0], [1996, 2154, 2077, 2057, 2253, 2000, 1031, 7308, 1033, 2000, 9526, 7355]], dtype=int32)>
hugging face: from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Serdarmuhammet/bert-base-banking77") tokenizer(['i am going to [MASK] to study math', 'the day before we went to [MASK] to cure illness'], return_tensors='tf', max_length=20, padding=True)['input_ids'] result: <tf.Tensor: shape=(2, 12), dtype=int32, numpy= array([[ 101, 1045, 2572, 2183, 2000, 103, 2000, 2817, 8785, 102, 0, 0], [ 101, 1996, 2154, 2077, 2057, 2253, 2000, 103, 2000, 9526, 7355, 102]], dtype=int32)>