machinelearning
machinelearning copied to clipboard
ML.NET is an open source and cross-platform machine learning framework for .NET.
**System Information:** - Windows 10 - ML.NET 3.0.1 - .NET 4.7.2 **Describe the bug** I am unable to use this library when my project is built with .Net 4.7.2. The...
In ML.NET we currently only have the [KMeansTrainer](https://docs.microsoft.com/en-us/dotnet/machine-learning/resources/tasks#clustering). The main challenge with that clustering trainer is that you need to provide the number of clusters to use (numberOfClusters param also...
The implementation at https://github.com/openai/tiktoken/commits/main/src/lib.rs has seen several improvements in the last year (eg https://github.com/openai/tiktoken/pull/255), including a couple that claim perf wins around algorithmic complexity for long inputs. The comments in...
Especially, the key metrices like - 1) Mean Absolute Error (MAE) 2) Mean Squared Error (MSE) 3) Root Mean Squared Error (RMSE) etc.
**System Information (please complete the following information):** - OS & Version: [e.g. Windows 10] Win 11 - ML.NET Version: [e.g. ML.NET v1.5.5] latest - .NET Version: [e.g. .NET 5.0] .NET...
Port CLIP tokenizer which leverages byte-level BPE. This tokenizer enables scenarios like StableDiffusion May be dependent on https://github.com/dotnet/machinelearning/issues/6992. Reference: https://huggingface.co/docs/transformers/main/en/model_doc/clip https://github.com/huggingface/transformers/blob/0549000c5bf6c7249f411917f2a6f0b6d0f06da1/src/transformers/models/codegen/tokenization_codegen.py#L98 https://onnxruntime.ai/docs/tutorials/csharp/stable-diffusion-csharp.html#tokenization-with-onnx-runtime-extensions Paper: https://arxiv.org/abs/2103.00020 https://arxiv.org/pdf/2103.00020.pdf
Port Codegen Tokenizer to enable Phi-2 models Reference: https://huggingface.co/docs/transformers/main/en/model_doc/codegen https://github.com/huggingface/transformers/blob/0549000c5bf6c7249f411917f2a6f0b6d0f06da1/src/transformers/models/codegen/tokenization_codegen.py#L98 Paper: https://arxiv.org/abs/2203.13474 https://arxiv.org/pdf/2203.13474.pdf
The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable. We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/google/sentencepiece...
**Goal**: Enable .NET developers to use tokenizers in their data pre-processing pipelines as part of their embedding and token generation tasks using language models. Committed: - [ ] Add support...
Porting BERTTokenizers enables several text embedding generation models. Requires https://github.com/dotnet/machinelearning/issues/6988. https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#text-embeddings. https://github.com/huggingface/transformers/blob/v4.37.0/src/transformers/models/bert/tokenization_bert.py#L137 cc @luisquintanilla We already have some BERT implementation which may be sufficient.