tokenizers How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer?

Hello, I'm currently working on training a byte-level BPE tokenizer using the Huggingface tokenizers library. I've created a simple training script, a sample corpus, and provided the output produced by this script. My aim is to understand why consecutive newline tokens \n are not being merged into a single token \n\n during the tokenization process. Below are the details:

from tokenizers import (
    Tokenizer,
    pre_tokenizers,
    models,
    decoders,
    trainers,
    processors,
)

files = ["demo_corpus.txt"]
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.Digits(individual_digits=True),
    pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=True)
])
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel()

trainer = trainers.BpeTrainer(
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    vocab_size=2000,
    special_tokens=[
        "<pad>", "<|beginoftext|>", "<|endoftext|>"
    ]
)
tokenizer.train(files, trainer)
test_text = "#include <set>\n\n\n\n\n"

print("pre-tokenize spans:", tokenizer.pre_tokenizer.pre_tokenize_str(test_text))
ids = tokenizer.encode(test_text).ids
print(f"tokens: {[tokenizer.decode([tid]) for tid in ids]}")

demo_corpus.txt:

#include <cstdio>

#include <vector>

#include <set>

using namespace std;

int main(){
    int N, A[100000], p = 0;

    multiset<int> S;

    scanf("%d", &N);

    int p0 = 0, q0 = 1, q = N-1;

    vector<int> result;

    for(int i: result)

        printf("%d\n", i);
}

output of training script:

pre-tokenize spans: [('#', (0, 1)), ('include', (1, 8)), ('Ġ<', (8, 10)), ('set', (10, 13)), ('>', (13, 14)), ('ĊĊĊĊĊ', (14, 19))]
tokens: ['#', 'include', ' <', 'set', '>', '\n', '\n', '\n', '\n', '\n']

the following is tokens produced by llama3 tokenizer:

tokenizer = LlamaTokenizerFast.from_pretrained("my llama3 vocab path")
test_text = "#include <set>\n\n\n\n\n"
print([tokenizer.decode([tid]) for tid in tokenizer(test_text)["input_ids"]])

# output
# ['<|begin_of_text|>', '#include', ' <', 'set', '>\n\n\n\n\n']

May 18 '24 03:05 liuslnlp

Hi, @Narsil @ArthurZucker I need some help.

May 31 '24 12:05 liuslnlp

Possibly related: https://github.com/meta-llama/llama3/issues/227

Jun 03 '24 22:06 josharian

Hey! That is a good question will answer in a bit

Jun 11 '24 13:06 ArthurZucker

Oups, I think it depends on the content of the demo.txt, but if it saw a lot more instances of \n\n\n\n it might not have merge rules for \n and \n alone, while it can have \n\n \n and \n\n\n \n. NOt super sure but you can check the merges.txt and manually add the missing entries if you want to force them

Jul 26 '24 10:07 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Aug 26 '24 01:08 github-actions[bot]

tokenizers tokenizers copied to clipboard

How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer?

tokenizers
tokenizers copied to clipboard