semantic-kernel icon indicating copy to clipboard operation
semantic-kernel copied to clipboard

Update Tokenizer to use Microsoft.ML.Tokenizers library

Open luisquintanilla opened this issue 1 year ago • 6 comments

The existing tokenizer implementation supports only GPT models. The Microsoft.ML.Tokenizers package provides a BPE tokenizer implementation which can be used with GPT models. In addition though, you can also load your own vocabulary files for use with other models that support BPE tokenization.

Here is a related sample that loads custom vocab files for GPT-2 from HuggingFace.

https://gist.github.com/luisquintanilla/bc91de8668cfa7c3755b20329fadd027

API Documentation

luisquintanilla avatar Apr 17 '23 15:04 luisquintanilla

@luisquintanilla , thanks for the suggestion, we will take a look.

evchaki avatar Apr 18 '23 20:04 evchaki

@luisquintanilla we're doing some work to integrate tiktoken, does ML.Tokenizer include additional tokenizers, not part of tiktoken?

dluc avatar Apr 19 '23 07:04 dluc

@luisquintanilla we're doing some work to integrate tiktoken, does ML.Tokenizer include additional tokenizers, not part of tiktoken?

@dluc Currently ML.Tokenizers supports only BPE which I think is also the only one tiktoken supports.

@tarekgh can confirm which tokenizers are supported.

luisquintanilla avatar Apr 19 '23 13:04 luisquintanilla

Right, ML.Tokenizers supports support Bpe which tiktoken support according to https://github.com/openai/tiktoken. ML.Tokenizers support EnglishRoberta too but this is not supported by tiktoken.

tarekgh avatar Apr 19 '23 15:04 tarekgh

thanks for the info, work in progress

dluc avatar May 08 '23 04:05 dluc

https://github.com/microsoft/Tokenizer

This repo contains C# and Typescript implementation of byte pair encoding(BPE) tokenizer for OpenAI LLMs, it's based on open sourced rust implementation in the OpenAI tiktoken. Both implementation are valuable to run prompt tokenization in .NET and Nodejs environment before feeding prompt into a LLM.

JadynWong avatar May 17 '23 12:05 JadynWong

#2809 and #2147 made it easier to bring any tokenizer for use. #2840 demonstrates these capabilities specifically with MicrosoftML and DeepDev token counters along existing demonstration with SharpToken.

lemillermicrosoft avatar Sep 22 '23 16:09 lemillermicrosoft

Thanks for these changes @lemillermicrosoft. They look great. Feel free to close this when the PR merges.

luisquintanilla avatar Sep 22 '23 16:09 luisquintanilla

Fantastic Utility, Thank You!

What class supports the BERT Type Tokenizers? Same sort of thing as FastBertTokenizer and BertTokenizer?

public sealed class Bpe : Model public sealed class EnglishRoberta : Model public sealed class Tiktoken : Model

These three classes seem to be the only Classes that Implement the Model Class.

The class EnglishRoberta does not give the same result as the previously mentioned classes for the given vocab files.

Would be fantastic to use one class for all the main models atm.

Thank You!

KokinSok avatar Feb 22 '24 20:02 KokinSok

is there any sort of cross-team effort (between https://github.com/dotnet/machinelearning and https://github.com/microsoft/semantic-kernel) to get the dotnet version of semantic kernel only using ML.NET tokenizers? This seems like it was a first step, so trying to track progress of further work. I see that dotnet semantic kernel uses 4 different tokenizer libraries (from microsoft/semantic-kernel/dotnet/Directory.Packages.props): image

KashMoneyMillionaire avatar Apr 04 '24 21:04 KashMoneyMillionaire

I see that dotnet semantic kernel uses 4 different tokenizer libraries (from microsoft/semantic-kernel/dotnet/Directory.Packages.props)

That's just in the samples. The actual SK libraries don't use most of them (the ONNX library does use the FastBertTokenizer, but there's a separate issue tracking adding a BERT tokenizer to the ML.NET tokenizer lib).

stephentoub avatar Apr 04 '24 21:04 stephentoub

Ahh, that makes sense. I saw SharpToken was only used in samples but FastBert was used in code, so stopped there and didn't look at DeepDev. That was my 2nd question though - do the samples need different tokenizers, or should we be pushing them to use the ML.NET tokenizer libraries? Or should we be removing those libraries from the dotnet/Directory.Packages.props and just put them straight in the dotnet/samples/KernelSyntaxExamples/KernelSyntaxExamples.csproj?

KashMoneyMillionaire avatar Apr 04 '24 21:04 KashMoneyMillionaire

Our collective goal is that all relevant tokenizers end up in Microsoft.ML.Tokenizers so that it's a one-stop shop for tokenization needs in .NET. Microsoft.ML.Tokenizers now provides a Tiktoken implementation, so SK wouldn't need DeepDev or SharpToken, but it doesn't yet have a BERT tokenizer. Different models need different tokenizers; tiktoken is what's used by OpenAI's models. We could probably at this point refine the sample that's using DeepDev and SharpToken to not, but really all that sample is showing is that you can use different libraries within the delegate passed to TextChunker.

stephentoub avatar Apr 04 '24 21:04 stephentoub

Love it, that makes a lot of sense. I've been diving into these two repos and roadmaps and plans and docs over the last few days, and I got the sense that was the goal, but was difficult to find a definitive place essentially saying that. I'm assuming the ML.NET Roadmap is out of date, but that the ML.NET Milestones are up to date. Any pointers on the best place to find info similar to that for SK?

Thanks for the time clarifying!

KashMoneyMillionaire avatar Apr 04 '24 22:04 KashMoneyMillionaire