machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Improve Microsoft.ML.Tokenizers and drive complete 1.0 API

Open ericstj opened this issue 1 year ago • 0 comments

Goal: Enable .NET developers to use tokenizers in their data pre-processing pipelines as part of their embedding and token generation tasks using language models.

Committed:

  • [ ] Add support for more commonly used Tokenizers
    • [x] TikToken https://github.com/dotnet/machinelearning/pull/6981
    • [x] LlamaTokenizer & SentencePiece algorithm https://github.com/dotnet/machinelearning/issues/6987
    • [x] CodeGenTokenizer & Byte-level BPE https://github.com/dotnet/machinelearning/issues/6992
    • [x] WordPiece algorithm https://github.com/dotnet/machinelearning/issues/6988
    • [x] BERTTokenizer https://github.com/dotnet/machinelearning/issues/6991
  • [x] Measure and improve performance of Tokenizers API - making breaking changes where necessary. (https://github.com/dotnet/machinelearning/issues/6982)
  • [ ] Explore existing construction patterns to improve usability - both in factory API and load from configuration.
  • [x] Drive adoption of Microsoft.ML.Tokenizers in other libraries
  • [ ] Docs and samples

Backlog:

  • [ ] Investigate using Microsoft.ML.Tokenizers in Azure OpenAI SDK
  • [ ] Sentencepiece Unigram https://github.com/dotnet/machinelearning/issues/7186
  • [ ] CLIP Tokenizer https://github.com/dotnet/machinelearning/issues/6993

ericstj avatar Feb 02 '24 19:02 ericstj