machinelearning issues

The type initializer for 'Microsoft.ML.Transforms.TimeSeries.FftUtils' threw an exception.

1

**System Information:** - Windows 10 - ML.NET 3.0.1 - .NET 4.7.2 **Describe the bug** I am unable to use this library when my project is built with .Net 4.7.2. The...

mjanulaitis

untriaged

needs-author-action

[Clustering] Create/Add an additional trainer for Clustering: Affinity Propagation

4

In ML.NET we currently only have the [KMeansTrainer](https://docs.microsoft.com/en-us/dotnet/machine-learning/resources/tasks#clustering). The main challenge with that clustering trainer is that you need to provide the number of clusters to use (numberOfClusters param also...

CESARDELATORRE

enhancement

P2

Ensure tiktoken implementation up-to-date with OpenAI reference implementation

1

The implementation at https://github.com/openai/tiktoken/commits/main/src/lib.rs has seen several improvements in the last year (eg https://github.com/openai/tiktoken/pull/255), including a couple that claim perf wins around algorithmic complexity for long inputs. The comments in...

stephentoub

enhancement

Add metrices calculation for model performance review.

Especially, the key metrices like - 1) Mean Absolute Error (MAE) 2) Mean Squared Error (MSE) 3) Root Mean Squared Error (RMSE) etc.

amitchaudhary

enhancement

untriaged

ResourceManagerUtils.DownloadResource aquires mutex on one thread and releases from another

**System Information (please complete the following information):** - OS & Version: [e.g. Windows 10] Win 11 - ML.NET Version: [e.g. ML.NET v1.5.5] latest - .NET Version: [e.g. .NET 5.0] .NET...

ericstj

[Tokenizers] Port CLIP Tokenizer

1

Port CLIP tokenizer which leverages byte-level BPE. This tokenizer enables scenarios like StableDiffusion May be dependent on https://github.com/dotnet/machinelearning/issues/6992. Reference: https://huggingface.co/docs/transformers/main/en/model_doc/clip https://github.com/huggingface/transformers/blob/0549000c5bf6c7249f411917f2a6f0b6d0f06da1/src/transformers/models/codegen/tokenization_codegen.py#L98 https://onnxruntime.ai/docs/tutorials/csharp/stable-diffusion-csharp.html#tokenization-with-onnx-runtime-extensions Paper: https://arxiv.org/abs/2103.00020 https://arxiv.org/pdf/2103.00020.pdf

ericstj

enhancement

P2

[Tokenizers] Port CodeGenTokenizer & byte-level BPE algorithm

Port Codegen Tokenizer to enable Phi-2 models Reference: https://huggingface.co/docs/transformers/main/en/model_doc/codegen https://github.com/huggingface/transformers/blob/0549000c5bf6c7249f411917f2a6f0b6d0f06da1/src/transformers/models/codegen/tokenization_codegen.py#L98 Paper: https://arxiv.org/abs/2203.13474 https://arxiv.org/pdf/2203.13474.pdf

ericstj

enhancement

P1

[Tokenizers] Port LLaMA Tokenizer and SentencePiece algorithm

The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable. We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/google/sentencepiece...

ericstj

enhancement

Improve Microsoft.ML.Tokenizers and drive complete 1.0 API

**Goal**: Enable .NET developers to use tokenizers in their data pre-processing pipelines as part of their embedding and token generation tasks using language models. Committed: - [ ] Add support...

ericstj

enhancement

Epic

[Tokenizers] Port BERTTokenizers

1

Porting BERTTokenizers enables several text embedding generation models. Requires https://github.com/dotnet/machinelearning/issues/6988. https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#text-embeddings. https://github.com/huggingface/transformers/blob/v4.37.0/src/transformers/models/bert/tokenization_bert.py#L137 cc @luisquintanilla We already have some BERT implementation which may be sufficient.

ericstj

enhancement

P1

machinelearning
machinelearning copied to clipboard

Metadata

The type initializer for 'Microsoft.ML.Transforms.TimeSeries.FftUtils' threw an exception.

[Clustering] Create/Add an additional trainer for Clustering: Affinity Propagation

Ensure tiktoken implementation up-to-date with OpenAI reference implementation

Add metrices calculation for model performance review.

ResourceManagerUtils.DownloadResource aquires mutex on one thread and releases from another

[Tokenizers] Port CLIP Tokenizer

[Tokenizers] Port CodeGenTokenizer & byte-level BPE algorithm

[Tokenizers] Port LLaMA Tokenizer and SentencePiece algorithm

Improve Microsoft.ML.Tokenizers and drive complete 1.0 API

[Tokenizers] Port BERTTokenizers

← Metadata

Owner

Metadata

machinelearning machinelearning copied to clipboard

Metadata

← Metadata

Owner

Metadata

machinelearning
machinelearning copied to clipboard