tokenizers
tokenizers copied to clipboard
How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer?
Hello, I'm currently working on training a byte-level BPE tokenizer using the Huggingface tokenizers library. I've created a simple training script, a sample corpus, and provided the output produced by this script. My aim is to understand why consecutive newline tokens \n are not being merged into a single token \n\n during the tokenization process. Below are the details:
from tokenizers import (
Tokenizer,
pre_tokenizers,
models,
decoders,
trainers,
processors,
)
files = ["demo_corpus.txt"]
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
pre_tokenizers.Digits(individual_digits=True),
pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=True)
])
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel()
trainer = trainers.BpeTrainer(
initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
vocab_size=2000,
special_tokens=[
"<pad>", "<|beginoftext|>", "<|endoftext|>"
]
)
tokenizer.train(files, trainer)
test_text = "#include <set>\n\n\n\n\n"
print("pre-tokenize spans:", tokenizer.pre_tokenizer.pre_tokenize_str(test_text))
ids = tokenizer.encode(test_text).ids
print(f"tokens: {[tokenizer.decode([tid]) for tid in ids]}")
demo_corpus.txt:
#include <cstdio>
#include <vector>
#include <set>
using namespace std;
int main(){
int N, A[100000], p = 0;
multiset<int> S;
scanf("%d", &N);
int p0 = 0, q0 = 1, q = N-1;
vector<int> result;
for(int i: result)
printf("%d\n", i);
}
output of training script:
pre-tokenize spans: [('#', (0, 1)), ('include', (1, 8)), ('Ä <', (8, 10)), ('set', (10, 13)), ('>', (13, 14)), ('ÄŠÄŠÄŠÄŠÄŠ', (14, 19))]
tokens: ['#', 'include', ' <', 'set', '>', '\n', '\n', '\n', '\n', '\n']
the following is tokens produced by llama3 tokenizer:
tokenizer = LlamaTokenizerFast.from_pretrained("my llama3 vocab path")
test_text = "#include <set>\n\n\n\n\n"
print([tokenizer.decode([tid]) for tid in tokenizer(test_text)["input_ids"]])
# output
# ['<|begin_of_text|>', '#include', ' <', 'set', '>\n\n\n\n\n']
Hi, @Narsil @ArthurZucker I need some help.
Possibly related: https://github.com/meta-llama/llama3/issues/227
Hey! That is a good question will answer in a bit
Oups, I think it depends on the content of the demo.txt, but if it saw a lot more instances of \n\n\n\n it might not have merge rules for \n and \n alone, while it can have \n\n \n and \n\n\n \n. NOt super sure but you can check the merges.txt and manually add the missing entries if you want to force them
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.