machinelearning
machinelearning copied to clipboard
Help with Custom VarVector to Vector mapping (does it have to so hard?)
System information
- OS version/distro: Windows 10
- .NET Version (eg., dotnet --info): ML.NET v4.0.30319
Issue
- What did you do?
- I have a data class that contains single fields and fields that are arrays
- What happened?
- How to custom map the the varvector generated into a single vector
- similar to https://github.com/dotnet/machinelearning/issues/4977
- What did you expect?
- Easy way to convert the varvectors into a single vector
Source code / logs
// Code abbreviated
MLContext mlContext = new MLContext();
SchemaDefinition definedSchema;
// For fixed array dimension sizes use
definedSchema = SchemaDefinition.Create(typeof(MLDataForAnalysisFactored));
// If we can use variable array dimension sizes we can use this
int featuresCount = SetFeatureArrayDimensions(mlDataList, out definedSchema);
IDataView trainDataView = mlContext.Data.LoadFromEnumerable<MLDataForAnalysisFactored>(mlDataList, definedSchema);
// The label must not be in the input features
string[] features = typeof(MLDataForAnalysisFactored).GetProperties().ToList().
Where(p => p.Name != labelColumnName).Select(x => x.Name).ToArray();
var estimatorChain = mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: "Label", inputColumnName:
nameof(MLDataForAnalysisFactored.Call))
.Append(mlContext.Transforms.Concatenate("Features", features));
var transformedTrainData = estimatorChain.Fit(trainDataView);
trainDataView = transformedTrainData.Transform(trainDataView);
// check the data
var rowEnumerable = mlContext.Data
.CreateEnumerable<MLDataForAnalysisFactored>(trainDataView,
reuseRowObject: true).ToList();
var schema = trainDataView.Schema;
//var featureColumns = trainDataView.GetColumn<float[]>(trainDataView.Schema["Features"]).Take(4);
//create a custom schema-definition that overrides the type for the Values field...
var singleVectorSchemaDef = SchemaDefinition.Create(typeof(SingleVector));
singleVectorSchemaDef[nameof(SingleVector.Values)].ColumnType
= new VectorDataViewType(NumberDataViewType.Single, featuresCount);
// How to do a custom transform to map the varvectors to a single vector?
// otherwise you have to do this.
////use that schema definition when creating the training dataview
//trainDataView = mlContext.Data.LoadFromEnumerable(mlDataList, singleVectorSchemaDef);
// check the data
// var someRows = mlContext.Data. // Convert to an enumerable of user-defined type.
// .CreateEnumerable<MLDataForAnalysisFactored>(trainDataView, reuseRowObject: false)
//// Take a couple values as an array.
//.Take(4).ToArray();
// Extract the 'AllFeatures' column.
// This will give the entire dataset: make sure to only take several row
// in case the dataset is huge. The is similar to the static API, except
// you have to specify the column name and type.
//var featureColumns = estimatorChain.GetColumn<float[]>(trainDataView.Schema["Features"]);
// STEP 2: Run AutoML experiment
Console.WriteLine($"Running AutoML multiclass classification experiment for {ExperimentTime} seconds...");
ExperimentResult<MulticlassClassificationMetrics> experimentResult = mlContext.Auto()
.CreateMulticlassClassificationExperiment(ExperimentTime)
.Execute(trainDataView, LabelColumnName, null, estimatorChain);
where SingleVector is
public class SingleVector
{
//it's not required to specify the type here since we will override in our custom schema
public float[] Values;
}
// Above function SetFeatureArrayDimensions
// You would expect that once the size of the var vectors has been set then that is all you should have to do but ML.NET needs
// a single vector.
private int SetFeatureArrayDimensions(List<MLDataForAnalysisFactored> mlDataList, out SchemaDefinition definedSchema)
{
int totalFeatureCount = 0;
// STEP 1: Load data
// The feature dimension (typically this will be the Count of the array
// of the features vector known at runtime).
int featureArrayDimension = 0;
definedSchema = SchemaDefinition.Create(typeof(MLDataForAnalysisFactored));
var properties = typeof(MLDataForAnalysisFactored).GetProperties();
foreach (var property in properties)
{
if (property.PropertyType.IsArray)
{
Array array = property.GetValue(mlDataList[0]) as Array;
featureArrayDimension = array.Length;
totalFeatureCount += featureArrayDimension;
//// Set the column type to be a known-size vector.
var vectorItemType = ((VectorDataViewType)definedSchema[property.Name].ColumnType)
.ItemType;
definedSchema[property.Name].ColumnType = new VectorDataViewType(vectorItemType,
featureArrayDimension);
var featureColumn = definedSchema[property.Name]
.ColumnType as VectorDataViewType;
Diag.Debug.WriteLine($"Is the size of the Feature array {property.Name} column known: " +
$"{featureColumn.IsKnownSize}.\nSize: {featureColumn.Size}");
}
else
{
totalFeatureCount++;
}
}
return (totalFeatureCount);
}
Anyway I can get away with using fixed array sizes. Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.