Eric StJohn
Eric StJohn
## Build Information Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=419700 Build error leg or test failing: Microsoft.ML.Fairlearn.Tests.GridSearchTest.TestGridSearchTrialRunner2 Pull request: https://github.com/dotnet/machinelearning/pull/6837 ## Error Message ``` Exception Message System.AggregateException : One or more errors occurred. (Unable to...
## Build Information Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=419700 Build error leg or test failing: Microsoft.ML.Fairlearn.Tests.WorkItemExecution Pull request: https://github.com/dotnet/machinelearning/pull/6837 ## Error Message ``` [Source=AutoMLExperiment, Kind=Trace] Channel started [Source=AutoMLExperiment, Kind=Info] Update Running Trial - Id:...
Originally I was just trying to remove mentions of snupkg, but then things got a bit carried away. :) This is trying to remove as much duplication and dead code...
**System Information (please complete the following information):** - OS & Version: Windows 11 22H2 x64 - ML.NET Version: https://github.com/dotnet/machinelearning/pull/6703 - .NET Version: Net 8 **Describe the bug** When updating OnnxRuntime...
**System Information (please complete the following information):** - OS & Version: [e.g. Windows 10] Win 11 - ML.NET Version: [e.g. ML.NET v1.5.5] latest - .NET Version: [e.g. .NET 5.0] .NET...
Port CLIP tokenizer which leverages byte-level BPE. This tokenizer enables scenarios like StableDiffusion May be dependent on https://github.com/dotnet/machinelearning/issues/6992. Reference: https://huggingface.co/docs/transformers/main/en/model_doc/clip https://github.com/huggingface/transformers/blob/0549000c5bf6c7249f411917f2a6f0b6d0f06da1/src/transformers/models/codegen/tokenization_codegen.py#L98 https://onnxruntime.ai/docs/tutorials/csharp/stable-diffusion-csharp.html#tokenization-with-onnx-runtime-extensions Paper: https://arxiv.org/abs/2103.00020 https://arxiv.org/pdf/2103.00020.pdf
Port Codegen Tokenizer to enable Phi-2 models Reference: https://huggingface.co/docs/transformers/main/en/model_doc/codegen https://github.com/huggingface/transformers/blob/0549000c5bf6c7249f411917f2a6f0b6d0f06da1/src/transformers/models/codegen/tokenization_codegen.py#L98 Paper: https://arxiv.org/abs/2203.13474 https://arxiv.org/pdf/2203.13474.pdf
The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable. We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/google/sentencepiece...
**Goal**: Enable .NET developers to use tokenizers in their data pre-processing pipelines as part of their embedding and token generation tasks using language models. Committed: - [ ] Add support...
Porting BERTTokenizers enables several text embedding generation models. Requires https://github.com/dotnet/machinelearning/issues/6988. https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#text-embeddings. https://github.com/huggingface/transformers/blob/v4.37.0/src/transformers/models/bert/tokenization_bert.py#L137 cc @luisquintanilla We already have some BERT implementation which may be sufficient.