Is TorchText implemented?
For TorchSharp text classification example there is TorchText used to load data set.
I am not sure what I am doing wrong, but I can not find any usings to import this library.
For TorchSharp MNIST example I did manage to find and install proper NuGet to use torchvision.
Is TorchText implemented for .NET?
If not, alternatively, how can I load data from CSV file? I do not know what data type should be used for var reader in the example? Im confused.
I think we don't have torchtext support currently, and I've found the class in Examples.Utils.
We do not have that implemented.
Maybe @luisquintanilla can comment on some of the text-based preprocessing primitives we've added to ML.NET -- there's a few new tokenizers there, which should be usable with TorchSharp.
@LittleLittleCloud
Could you share your view which of the recent progress in ML.NET, regarding deep NLP, could be relevant for advancing TorchText project using TorchSharp?
References
- https://github.com/dotnet/TorchSharp/discussions/1340
- https://github.com/dotnet/TorchSharp/discussions/1103
- https://github.com/dotnet/TorchSharp/discussions/590
- https://github.com/dotnet/TorchSharp/discussions/610
TorchText from Pytorch
PyTorch TorchText
torchtext.nn torchtext.data.functional torchtext.data.metrics torchtext.data.utils torchtext.datasets torchtext.vocab torchtext.utils torchtext.transforms torchtext.functional torchtext.models
Tutorials
- Text classification with XLM-RoBERTa mode
- T5-Base Model for Summarization, Sentiment Classification, and Translation
Tokenizers/Traansform from PyTorch
https://pytorch.org/text/stable/transforms.html
Tokenizers
- [ ] SentencePieceTokenizer
- [ ] GPT2BPETokenizer
- [ ] CLIPTokenizer
- [ ] RegexTokenizer
- [ ] BERTTokenizer
- [ ] CharBPETokenizer
Transform
- VocabTransform
- PadTransform
- StrToIntTransform
Utils
ToTensor LabelToIndex Truncate AddToken Sequential
Microsoft.ML.Tokenizers
Microsoft.ML.Tokenizers
- Microsoft.ML.Tokenizers
- Microsoft.ML.Tokenizers.Data.Cl100kBase
- Microsoft.ML.Tokenizers.Data.Gpt2
- Microsoft.ML.Tokenizers.Data.O200kBase
- Microsoft.ML.Tokenizers.Data.P50kBase
- Microsoft.ML.Tokenizers.Data.R50kBase
Models
- BPETokenizer.cs
- BertTokenizer.cs
- CodeGenTokenizer.cs
- EnglishRobertaTokenizer.cs
- LlamaTokenizer.cs
- Phi2Tokenizer.cs
- SentencePieceTokenizer.cs
- TiktokenTokenizer.cs
- WordPieceTokenizer.cs
- Merge.cs
- ModelSourceGenerationContext.cs
- Pair.cs
- Symbol.cs
- Word.cs
- Cache.cs
Normalizers
- BertNormalizer.cs
- LowerCaseNormalizer.cs
- Normalizer.cs
- SentencePieceNormalizer.cs
- UpperCaseNormalizer.cs
PreTokenizers
- PreTokenizer.cs
- RegexPreTokenizer.cs
- RobertaPreTokenizer.cs