machinelearning
machinelearning copied to clipboard
Improve Microsoft.ML.Tokenizers and drive complete 1.0 API
Goal: Enable .NET developers to use tokenizers in their data pre-processing pipelines as part of their embedding and token generation tasks using language models.
Committed:
- [ ] Add support for more commonly used Tokenizers
- [x] TikToken https://github.com/dotnet/machinelearning/pull/6981
- [x] LlamaTokenizer & SentencePiece algorithm https://github.com/dotnet/machinelearning/issues/6987
- [x] CodeGenTokenizer & Byte-level BPE https://github.com/dotnet/machinelearning/issues/6992
- [x] WordPiece algorithm https://github.com/dotnet/machinelearning/issues/6988
- [x] BERTTokenizer https://github.com/dotnet/machinelearning/issues/6991
- [x] Measure and improve performance of Tokenizers API - making breaking changes where necessary. (https://github.com/dotnet/machinelearning/issues/6982)
- [ ] Explore existing construction patterns to improve usability - both in factory API and load from configuration.
- [x] Drive adoption of Microsoft.ML.Tokenizers in other libraries
- [ ] Docs and samples
Backlog:
- [ ] Investigate using Microsoft.ML.Tokenizers in Azure OpenAI SDK
- [ ] Sentencepiece Unigram https://github.com/dotnet/machinelearning/issues/7186
- [ ] CLIP Tokenizer https://github.com/dotnet/machinelearning/issues/6993