machinelearning Improve Microsoft.ML.Tokenizers and drive complete 1.0 API

Improve Microsoft.ML.Tokenizers and drive complete 1.0 API

Open ericstj opened this issue 1 year ago • 0 comments

Goal: Enable .NET developers to use tokenizers in their data pre-processing pipelines as part of their embedding and token generation tasks using language models.

Committed:

[ ] Add support for more commonly used Tokenizers
- [x] TikToken https://github.com/dotnet/machinelearning/pull/6981
- [x] LlamaTokenizer & SentencePiece algorithm https://github.com/dotnet/machinelearning/issues/6987
- [x] CodeGenTokenizer & Byte-level BPE https://github.com/dotnet/machinelearning/issues/6992
- [x] WordPiece algorithm https://github.com/dotnet/machinelearning/issues/6988
- [x] BERTTokenizer https://github.com/dotnet/machinelearning/issues/6991
[x] Measure and improve performance of Tokenizers API - making breaking changes where necessary. (https://github.com/dotnet/machinelearning/issues/6982)
[ ] Explore existing construction patterns to improve usability - both in factory API and load from configuration.
[x] Drive adoption of Microsoft.ML.Tokenizers in other libraries
[ ] Docs and samples

Backlog:

[ ] Investigate using Microsoft.ML.Tokenizers in Azure OpenAI SDK
[ ] Sentencepiece Unigram https://github.com/dotnet/machinelearning/issues/7186
[ ] CLIP Tokenizer https://github.com/dotnet/machinelearning/issues/6993

Feb 02 '24 19:02 ericstj

machinelearning machinelearning copied to clipboard

Improve Microsoft.ML.Tokenizers and drive complete 1.0 API

machinelearning
machinelearning copied to clipboard