machinelearning [Tokenizers] WhiteSpace pretokenizer not only splitting on whitepace

Given the following code:

using Microsoft.ML.Tokenizers;

ReadOnlySpan<char> sentence = "[CLS] I love AI [SEP]";

var pretokenizer = new WhiteSpace();

var tokens = pretokenizer.PreTokenize(sentence);

// Actual
Console.WriteLine("Actual");
foreach(var result in tokens)
{
    var substr = sentence.Slice(result.Offset, result.Length);
    Console.WriteLine(substr.ToString());
}

Console.WriteLine("----");

// Expected
Console.WriteLine("Expected");
foreach(var expected in sentence.ToString().Split(' '))
{
    Console.WriteLine(expected);
}

The WhiteSpace tokenizer is splitting based on non-alphanumeric and whitespace characters.

Output:

Actual:
[
CLS
]
I
love
AI
[
SEP
]
----
Expected:
[CLS]
I
love
AI
[SEP]

I would expect it to only split on whitespace based on the name.

Jun 26 '24 02:06 luisquintanilla

@luisquintanilla we are doing as what Huggingface doing with the white space pre-tokenizers.

https://github.com/dotnet/machinelearning/blob/8e3f72d0239d74d5c3cd681e7027edc65c18f1a0/src/Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs#L22 https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/pre_tokenizers/whitespace.rs#L21

Do you still want to change that?

Jun 26 '24 16:06 tarekgh

@luisquintanilla we are doing as what Huggingface doing with the white space pre-tokenizers.

https://github.com/dotnet/machinelearning/blob/8e3f72d0239d74d5c3cd681e7027edc65c18f1a0/src/Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs#L22

https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/pre_tokenizers/whitespace.rs#L21 Do you still want to change that?

I see. Thanks for clarifying. If we're consistent I'd leave it. I know you're working on adding special tokens.

Is that happening before or after pretokenization?

I think that's where discrepancies may lie.

When using WhiteSpace pretokenizer, the special tokens [CLS] for example get broken up as shown in the code example. I'd think that [CLS] should be considered a single token, even after pretokenization.

Jul 10 '24 19:07 luisquintanilla

@luisquintanilla it seems like we are consistent with huggingface here:

Were you noticing a problem that you thought was caused by this (in which case we should try to diagnose that) or were you just calling the API directly and expecting it to behave differently?

Aug 05 '24 19:08 ericstj

Thanks for investigating.

I was under the impression that [CLS] would be treated differently since it's a special token.

I think we can close this one since we're consistent.

Aug 05 '24 20:08 luisquintanilla