tokenizers Problem adding token with a specific replace normalizer

Hi @n1t0 , I want to represent numbers with the token [NUM]. A specific regex normalizer which keeps only alphanumeric characters causes tokenizer not to identify this token. This is my code:

from tokenizers import Tokenizer, Regex, AddedToken
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
from tokenizers.pre_tokenizers import Metaspace
from tokenizers import normalizers
import pandas as pd

t=Tokenizer(Unigram())

t.normalizer=normalizers.Sequence([
                                      normalizers.Replace(Regex('[^‌0-9a-zA-Z]'), ' '), # I mean this normalizer
                                      normalizers.Replace(Regex('[0-9]+'), '[NUM]')
                     ])
t.pre_tokenizer=Metaspace()

trainer=UnigramTrainer(special_tokens=['[PAD]','[EOS]','[UNK]'],unk_token='[UNK]')
t.train_from_iterator(['Brown fox jumped over the lazy dog'],trainer=trainer)
t.add_tokens([AddedToken('[NUM]')])

print(t.token_to_id('[NUM]'), t.id_to_token(2)) # 25 [UNK]
encoded = t.encode("Me? I'm 25 years old")
pd.DataFrame({'ids':encoded.ids, 'tokens':encoded.tokens}).T

Doing the whole thing again with the first normalizer commented out, results in:

print(t.token_to_id('[NUM]'),t.id_to_token(2)) # 25 [UNK]
pd.DataFrame({'ids':encoded.ids, 'tokens':encoded.tokens}).T

tokenizer library version: 0.12.1

Is anything wrong with my code or there is a bug ?

Jul 15 '22 20:07 sadra-barikbin

I am not sure what are your expectations ?

[NUM] seems present in both and correct, just the non ascii symbols ? and ' are converted to spaces which is the meaning of your first normalizer.

No ?

Jul 19 '22 06:07 Narsil

[NUM] is not identified in the first one. Its id is that of the [UNK] token.

Jul 19 '22 09:07 sadra-barikbin

Hi @n1t0 , I wrote a test to reproduce the bug:

#[test]
fn with_replace_normalizer(){

    let mut tokenizer = TokenizerBuilder::new()
        .with_model(Unigram::default())
        .with_normalizer(Some(Sequence::new(vec![
            Replace::new(ReplacePattern::Regex(r"[^‌0-9a-zA-Z]".into()), ' ').unwrap().into(),
            Replace::new(ReplacePattern::Regex(r"[0-9]+".into()), "[NUM]").unwrap().into()
        ])))
        .with_pre_tokenizer(Some(Metaspace::new('▁', false)))
        .with_post_processor(Some(ByteLevel::default()))
        .with_decoder(Some(ByteLevel::default()))
        .build()
        .unwrap();

    let mut trainer = UnigramTrainerBuilder::default()
        .show_progress(false)
        .unk_token(Some("[UNK]".into()))
        .build()
        .unwrap();

    tokenizer.add_tokens(&[AddedToken::from("[NUM]", false)]);
    tokenizer.train(&mut trainer, ["Brown fox jumped over the lazy dog"].iter())
    .expect("Error training the tokenizer");

    assert_eq!(tokenizer.encode("25", false).unwrap().get_ids()[0], tokenizer.token_to_id("[NUM]").unwrap());
}

Output:

failures:

---- with_replace_normalizer stdout ----
thread 'with_replace_normalizer' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`', tests/added_tokens.rs:161:5

Sep 25 '22 13:09 sadra-barikbin

But this cannot work.

add_tokens is for special tokens, and their capture happens before any processing of the initial string, it's NOT going to work for your intent. (Since normalizer will happen after the added_tokens have been processed).

If you remove [NUM] from the added tokens, keep your normalizer, and train on data that contains number it should create you a tokenizer with [NUM] within its vocabulary and it should work as you expect. Here the problem is that there is no number in your example.

Is this viable for you ?

Oct 05 '22 09:10 Narsil

But there is a separate add_special_tokens method for Tokenizer as well as AddedVocabulary!

About your solution, I think it's not a good one because the added tokens are somehow inserted into the model and make problems in certain scenarios like subword regularization.

I think I found the bug: Internally, in AddedVocabulary.refresh_added_tokens, classic added tokens (those which special attribute is False and normalized attribute is True) are inserted in a Trie structure called split_normalized_trie and the special tokens (those which special attribute is True and normalized attribute is False) are inserted in another Trie called split_trie. Then upon tokenization, in Tokenizer.encode_single_sequence -> AddedVocabulary.extract_and_normalize , firstly, special tokens are extracted using split_trie then classic added tokens are extracted from the normalized input sequence using split_normalized_trie. Finally the main tokenization step is done using the model.

Surprisingly, in AddedVocabulary.refresh_added_tokens, the added tokens are themselves normalized which does not make sense: https://github.com/huggingface/tokenizers/blob/61136666243ab9ebc44f5de96c3caf97c2228e51/tokenizers/src/tokenizer/added_vocabulary.rs#L305-L314

Furthermore according to this sentence, being "normalized" is not about the added tokens themselves but the input sequence:

https://github.com/huggingface/tokenizers/blob/4ef0afbeb6f0409cc272179059a83f4b00143816/bindings/python/py_src/tokenizers/init.pyi#L28-L32

Oct 05 '22 22:10 sadra-barikbin

After I removed

if let Some(n) = normalizer { 
    n.normalize(&mut content).unwrap(); 
}

from the code , the test passed.

Oct 05 '22 22:10 sadra-barikbin

Hi @sadra-barikbin ,

The tests passing is purely due to lack of testing, this is not valid to do 100% (we have a much bigger test suite in transformers which I'm sure would showcase the issues.

I looked more deeply, the issue lies in the normalizer you are creating. It's first removing [ and ] and then recreating [NUM]. This means it's impossible to "catch" correctly without breaking the code like you suggest.

Normalization is taken as a single step, and added token can either be before normalization (special tokens, which are special strings than need to be treated before normalization) or after normalization in which case we normalize the tokens you send us (so [NUM] becomes NUM because of the first replace).

And then we cannot recognize [NUM] because well it's not the non-special added token we have.

You can swap the two replace to fix your issue, but NUM will then be the special token might clash with real text.

make problems in certain scenarios like subword regularization.

I tried to make it work simply but couldn't simply. It's been a very long time I dived into the Unigram training implementation but looking at it I figured something was wrong/odd at the interaction of the suffix tree and the seed generation, preventing the seeds from seeing [NUM] on the simple dummy test you provided.

I made a dummy PR to leave this to further investigation.

SentencePiece Unigram trainer had a lot more stuff that we had to disentangle to fit this library's model, and we could never really achieve 1-1 parity with spm. (unfortunately).

This heavily drifts from the original issue, but hopefully it can provide a better path to what you want to achieve.

Would that direction work for you ?

Oct 06 '22 10:10 Narsil

What's the point in normalizing an added token that user has provided himself? Added tokens, no matter special or classic, are not supposed to be normalized, rather to be extracted from the normalized or unnormalized input sequence depending on the normalized attribute of the token.

Oct 06 '22 15:10 sadra-barikbin

Maybe there's something wrong with Unigram trainer but IMHO, the root cause for this issue is the snippet I put earlier.

Oct 06 '22 15:10 sadra-barikbin

What's the point in normalizing an added token that user has provided himself?

So that's it's capturable after normalization. That avoids users (and pre-existing tokenizers) from being forced into using normalized tokens and just works more than it causes problems (not here obviously). Let's say the normalization uses Lowercase, and you add a token for GOOG (which does appear and is important to you) then, it would never capture and as a user you would be forced to add goog instead. Since every new token has to understand how it's going to be normalized it's a hassle a lot of the time

It seems one untold property of normalization is that it should be idempotent ( normalize(normalize(string)) == normalize(string)). Which is not the case here for you. For instance here, it wouldn't be able to capture "This is a number [NUM]". Because it would turn it into NUM.

I hacked a little normalizer that would work:

    use crate::normalizers::replace::ReplacePattern;
    use crate::normalizers::Replace;
    use crate::normalizers::Sequence;
    use crate::tokenizer::Normalizer;
    use crate::NormalizedString;

    #[test]
    fn seq_replace() {
        let original = "[x] [NUM] hello 23";
        let normalized = " x  [NUM] hello [NUM]";
        let mut n = NormalizedString::from(original);
        let normalizer = Sequence::new(vec![
            Replace::new(ReplacePattern::Regex(r"[^a-z0-9A-Z\[\]]".into()), ' ')
                .unwrap()
                .into(),
            Replace::new(ReplacePattern::Regex(r"\[NUM\]".into()), "_NUM_")
                .unwrap()
                .into(),
            Replace::new(ReplacePattern::Regex(r"[\[\]]".into()), " ")
                .unwrap()
                .into(),
            Replace::new(ReplacePattern::Regex(r"_NUM_".into()), "[NUM]")
                .unwrap()
                .into(),
            Replace::new(ReplacePattern::Regex(r"[0-9]+".into()), "[NUM]")
                .unwrap()
                .into(),
        ]);
        normalizer.normalize(&mut n).unwrap();
        assert_eq!(&n.get(), &normalized);
    }

This is not the best use of regex, but would that work for you ?

Oct 07 '22 09:10 Narsil

Your point about normalizer's having to be idempotent was a good one. But regarding your answer:

You say this behaviour wants to free the user from the burden of determining the exact content of the token. But when a user adds a token he is already determining its "exact" content. When I want to capture goog, so I add goog not GOOG. Furthermore user doesn't seem to call add_tokens more than a few times in practice.

By the way this behaviour seems not to be stated anywhere in the documentation.

Oct 07 '22 22:10 sadra-barikbin

When I want to capture goog, so I add goog not GOOG

No really in financial text you will never encounter goog always GOOG and you would have to think about what normalizer you're using (and if you're playing with different normalization during exploration, you will always have to change them).

That being said, enabling power users to do these non-special non-normalized tokens could still be useful (your use case is definitely something that seems very reasonable). Would you be willing to open a PR on this ?

By the way this behaviour seems not to be stated anywhere in the documentation.

You are entirely right, documentation could benefit from a little help (especially those added tokens). PRs are welcome.

Oct 10 '22 10:10 Narsil

Tokenizers doesn't have a Discord server (or a channel in HF server) or Slack to be able to have fast-paced discussions and ...?

Oct 10 '22 12:10 sadra-barikbin

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Feb 09 '24 01:02 github-actions[bot]