tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer?

Open liuslnlp opened this issue 1 year ago • 5 comments

Hello, I'm currently working on training a byte-level BPE tokenizer using the Huggingface tokenizers library. I've created a simple training script, a sample corpus, and provided the output produced by this script. My aim is to understand why consecutive newline tokens \n are not being merged into a single token \n\n during the tokenization process. Below are the details:

from tokenizers import (
    Tokenizer,
    pre_tokenizers,
    models,
    decoders,
    trainers,
    processors,
)

files = ["demo_corpus.txt"]
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.Digits(individual_digits=True),
    pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=True)
])
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel()

trainer = trainers.BpeTrainer(
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    vocab_size=2000,
    special_tokens=[
        "<pad>", "<|beginoftext|>", "<|endoftext|>"
    ]
)
tokenizer.train(files, trainer)
test_text = "#include <set>\n\n\n\n\n"

print("pre-tokenize spans:", tokenizer.pre_tokenizer.pre_tokenize_str(test_text))
ids = tokenizer.encode(test_text).ids
print(f"tokens: {[tokenizer.decode([tid]) for tid in ids]}")

demo_corpus.txt:

#include <cstdio>

#include <vector>

#include <set>

using namespace std;

int main(){
    int N, A[100000], p = 0;

    multiset<int> S;

    scanf("%d", &N);

    int p0 = 0, q0 = 1, q = N-1;

    vector<int> result;

    for(int i: result)

        printf("%d\n", i);
}

output of training script:

pre-tokenize spans: [('#', (0, 1)), ('include', (1, 8)), ('Ä <', (8, 10)), ('set', (10, 13)), ('>', (13, 14)), ('ÄŠÄŠÄŠÄŠÄŠ', (14, 19))]
tokens: ['#', 'include', ' <', 'set', '>', '\n', '\n', '\n', '\n', '\n']

the following is tokens produced by llama3 tokenizer:

tokenizer = LlamaTokenizerFast.from_pretrained("my llama3 vocab path")
test_text = "#include <set>\n\n\n\n\n"
print([tokenizer.decode([tid]) for tid in tokenizer(test_text)["input_ids"]])

# output
# ['<|begin_of_text|>', '#include', ' <', 'set', '>\n\n\n\n\n']

liuslnlp avatar May 18 '24 03:05 liuslnlp

Hi, @Narsil @ArthurZucker I need some help.

liuslnlp avatar May 31 '24 12:05 liuslnlp

Possibly related: https://github.com/meta-llama/llama3/issues/227

josharian avatar Jun 03 '24 22:06 josharian

Hey! That is a good question will answer in a bit

ArthurZucker avatar Jun 11 '24 13:06 ArthurZucker

Oups, I think it depends on the content of the demo.txt, but if it saw a lot more instances of \n\n\n\n it might not have merge rules for \n and \n alone, while it can have \n\n \n and \n\n\n \n. NOt super sure but you can check the merges.txt and manually add the missing entries if you want to force them

ArthurZucker avatar Jul 26 '24 10:07 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Aug 26 '24 01:08 github-actions[bot]