TorchSharp icon indicating copy to clipboard operation
TorchSharp copied to clipboard

Is TorchText implemented?

Open przemyslawbak opened this issue 1 year ago • 3 comments

For TorchSharp text classification example there is TorchText used to load data set.

I am not sure what I am doing wrong, but I can not find any usings to import this library.

For TorchSharp MNIST example I did manage to find and install proper NuGet to use torchvision.

Is TorchText implemented for .NET?

If not, alternatively, how can I load data from CSV file? I do not know what data type should be used for var reader in the example? Im confused.

przemyslawbak avatar Oct 12 '24 08:10 przemyslawbak

I think we don't have torchtext support currently, and I've found the class in Examples.Utils.

yueyinqiu avatar Oct 12 '24 14:10 yueyinqiu

We do not have that implemented.

Maybe @luisquintanilla can comment on some of the text-based preprocessing primitives we've added to ML.NET -- there's a few new tokenizers there, which should be usable with TorchSharp.

NiklasGustafsson avatar Oct 15 '24 16:10 NiklasGustafsson

@LittleLittleCloud

Could you share your view which of the recent progress in ML.NET, regarding deep NLP, could be relevant for advancing TorchText project using TorchSharp?

References

  • https://github.com/dotnet/TorchSharp/discussions/1340
  • https://github.com/dotnet/TorchSharp/discussions/1103
  • https://github.com/dotnet/TorchSharp/discussions/590
  • https://github.com/dotnet/TorchSharp/discussions/610

TorchText from Pytorch

PyTorch TorchText

torchtext.nn torchtext.data.functional torchtext.data.metrics torchtext.data.utils torchtext.datasets torchtext.vocab torchtext.utils torchtext.transforms torchtext.functional torchtext.models

Tutorials


Tokenizers/Traansform from PyTorch

https://pytorch.org/text/stable/transforms.html

Tokenizers

  • [ ] SentencePieceTokenizer
  • [ ] GPT2BPETokenizer
  • [ ] CLIPTokenizer
  • [ ] RegexTokenizer
  • [ ] BERTTokenizer
  • [ ] CharBPETokenizer

Transform

  • VocabTransform
  • PadTransform
  • StrToIntTransform

Utils

ToTensor LabelToIndex Truncate AddToken Sequential


Microsoft.ML.Tokenizers

Microsoft.ML.Tokenizers

  • Microsoft.ML.Tokenizers
  • Microsoft.ML.Tokenizers.Data.Cl100kBase
  • Microsoft.ML.Tokenizers.Data.Gpt2
  • Microsoft.ML.Tokenizers.Data.O200kBase
  • Microsoft.ML.Tokenizers.Data.P50kBase
  • Microsoft.ML.Tokenizers.Data.R50kBase

# Microsoft.ML.Tokenizers

Models

  • BPETokenizer.cs
  • BertTokenizer.cs
  • CodeGenTokenizer.cs
  • EnglishRobertaTokenizer.cs
  • LlamaTokenizer.cs
  • Phi2Tokenizer.cs
  • SentencePieceTokenizer.cs
  • TiktokenTokenizer.cs
  • WordPieceTokenizer.cs

  • Merge.cs
  • ModelSourceGenerationContext.cs
  • Pair.cs
  • Symbol.cs
  • Word.cs
  • Cache.cs

Normalizers

  • BertNormalizer.cs
  • LowerCaseNormalizer.cs
  • Normalizer.cs
  • SentencePieceNormalizer.cs
  • UpperCaseNormalizer.cs

PreTokenizers

  • PreTokenizer.cs
  • RegexPreTokenizer.cs
  • RobertaPreTokenizer.cs

GeorgeS2019 avatar Oct 27 '24 07:10 GeorgeS2019