Verify word embedding model downloader
Internal user reported a stall during the .Fit() of the word embedding transform.
On first use of the word embedding transform, it downloads the word embedding model from the CDN.
To test:
- Clear any copies of the fastText300D word embedding file from local machine
Check local folder, and ~/.local/share/mlnet-resources/WordVectors/ for a file namedwiki.en.vec - Create example code using the FastTextWikipedia300D (6.6GB) in the word embedding transform
- Time how long it takes to download (or fail)
Example code:
var featurizeTextOptions = new TextFeaturizingEstimator.Options()
{
// Produce cleaned tokens for input to the word embedding transform
OutputTokensColumnName = "OutputTokens",
// Text cleaning (not shown is stop word removal)
KeepDiacritics = true, // Non-default
KeepPunctuations = false,
KeepNumbers = false, // Non-default
CaseMode = TextNormalizingEstimator.CaseMode.Lower,
// Row-wise normalization (see: NormalizeLpNorm)
Norm = TextFeaturizingEstimator.NormFunction.L2,
// Use ML.NET's built-in stop word remover (non-default)
StopWordsRemoverOptions = new StopWordsRemovingEstimator.Options() { Language = TextFeaturizingEstimator.Language.English },
// ngram options
WordFeatureExtractor = new WordBagEstimator.Options()
{
NgramLength = 2,
UseAllLengths = true, // Produce both unigrams and bigrams
Weighting = NgramExtractingEstimator.WeightingCriteria.Tf, // Can also use TF-IDF or IDF
},
// chargram options
CharFeatureExtractor = new WordBagEstimator.Options()
{
NgramLength = 3,
UseAllLengths = false, // Produce only tri-chargrams and not single/double characters
Weighting = NgramExtractingEstimator.WeightingCriteria.Tf, // Can also use TF-IDF or IDF
},
};
// Featurization pipeline
var pipeline = mlContext.Transforms.Conversion.MapValueToKey("Label", "Label") // Needed for multi-class to convert string labels to the Key type
// Create ngrams, and cleaned tokens for the word embedding
.Append(mlContext.Transforms.Text.FeaturizeText("FeaturesText", featurizeTextOptions, new[] { "InputText" })) // Use above options object
// Word embedding transform reads in the cleaned tokens from the text featurizer
.Append(mlContext.Transforms.Text.ApplyWordEmbedding("FeaturesWordEmbedding",
"OutputTokens", WordEmbeddingEstimator.PretrainedModelKind.FastTextWikipedia300D))
// Feature vector is the concatenation of the ngrams from the text transform, and the word embeddings
.Append(mlContext.Transforms.Concatenate("Features", new[] { "FeaturesText", "FeaturesWordEmbedding" }))
// Enable if numeric features are also included. Normalization is generally unneeded if only using the output from FeaturizeText as it's row-wise normalized w/ a L2-norm; word embeddings are also well behaved.
//.Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))
// Cache the featurized dataset in memory for added speed
.AppendCacheCheckpoint(mlContext);
// Trainer
var trainer = mlContext.MulticlassClassification.Trainers.OneVersusAll(mlContext.BinaryClassification.Trainers.AveragedPerceptron(labelColumnName: "Label", numberOfIterations: 10, featureColumnName: "Features"), labelColumnName: "Label")
.Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));
var trainingPipeline = pipeline.Append(trainer);
The code here shows a full example of the FeaturizeText for use with the ApplyWordEmbedding. Specifically, it creates the tokens for the ApplyWordEmbedding by removing numbers, keeping diacritics, and lowercases to match how the fastText model was created. The text cleaning reduces the out-of-vocabulary (OOV) issue in the word embedding. For any specific dataset, these options can be tested.
Side note:
We should make a sample of FeaturizeText with ApplyWordEmbedding. I wrote the above since I couldn't locate one to link-to in this issue.
Additional user report: https://github.com/dotnet/machinelearning/issues/5450#issuecomment-714930905
I want to work on this. Can anyone help me?
Hello! How we can go about using other language embedings for FastTextWikipedia300D? I mean if I use wiki.LangPrefix.vec with a language that isn't in the enums of ML the .fit() method just never finishes