machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Help with Custom VarVector to Vector mapping (does it have to so hard?)

Open acrigney opened this issue 3 years ago • 0 comments

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info): ML.NET v4.0.30319

Issue

  • What did you do?
  • I have a data class that contains single fields and fields that are arrays
  • What happened?
  • How to custom map the the varvector generated into a single vector
  • similar to https://github.com/dotnet/machinelearning/issues/4977
  • What did you expect?
  • Easy way to convert the varvectors into a single vector

Source code / logs

// Code abbreviated

MLContext mlContext = new MLContext();

SchemaDefinition definedSchema;            

// For fixed array dimension sizes use
definedSchema = SchemaDefinition.Create(typeof(MLDataForAnalysisFactored));

// If we can use variable array dimension sizes we can use this
int featuresCount = SetFeatureArrayDimensions(mlDataList, out definedSchema);

IDataView trainDataView = mlContext.Data.LoadFromEnumerable<MLDataForAnalysisFactored>(mlDataList, definedSchema);

// The label must not be in the input features
string[] features = typeof(MLDataForAnalysisFactored).GetProperties().ToList().
        Where(p => p.Name != labelColumnName).Select(x => x.Name).ToArray();

var estimatorChain = mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: "Label", inputColumnName: 
    nameof(MLDataForAnalysisFactored.Call))
    .Append(mlContext.Transforms.Concatenate("Features", features));

var transformedTrainData = estimatorChain.Fit(trainDataView);

trainDataView = transformedTrainData.Transform(trainDataView);

// check the data
var rowEnumerable = mlContext.Data
    .CreateEnumerable<MLDataForAnalysisFactored>(trainDataView,
    reuseRowObject: true).ToList();

var schema = trainDataView.Schema;

//var featureColumns = trainDataView.GetColumn<float[]>(trainDataView.Schema["Features"]).Take(4);

//create a custom schema-definition that overrides the type for the Values field...  
var singleVectorSchemaDef = SchemaDefinition.Create(typeof(SingleVector));
singleVectorSchemaDef[nameof(SingleVector.Values)].ColumnType
              = new VectorDataViewType(NumberDataViewType.Single, featuresCount);

// How to do a custom transform to map the varvectors to a single vector?
// otherwise you have to do this.
////use that schema definition when creating the training dataview  
//trainDataView = mlContext.Data.LoadFromEnumerable(mlDataList, singleVectorSchemaDef);

// check the data

//        var someRows = mlContext.Data.  // Convert to an enumerable of user-defined type. 
//                .CreateEnumerable<MLDataForAnalysisFactored>(trainDataView, reuseRowObject: false)
//// Take a couple values as an array.
//.Take(4).ToArray();

// Extract the 'AllFeatures' column.
// This will give the entire dataset: make sure to only take several row
// in case the dataset is huge. The is similar to the static API, except
// you have to specify the column name and type.

//var featureColumns = estimatorChain.GetColumn<float[]>(trainDataView.Schema["Features"]);

// STEP 2: Run AutoML experiment
Console.WriteLine($"Running AutoML multiclass classification experiment for {ExperimentTime} seconds...");            

ExperimentResult<MulticlassClassificationMetrics> experimentResult = mlContext.Auto()
    .CreateMulticlassClassificationExperiment(ExperimentTime)
    .Execute(trainDataView, LabelColumnName, null, estimatorChain);

where SingleVector is

  public class SingleVector
  {
      //it's not required to specify the type here since we will override in our custom schema 
      public float[] Values;
  }
  // Above function SetFeatureArrayDimensions
  // You would expect that once the size of the var vectors has been set then that is all you should have to do but ML.NET needs
  // a single vector.
  private int SetFeatureArrayDimensions(List<MLDataForAnalysisFactored> mlDataList, out SchemaDefinition definedSchema)
      {
          int totalFeatureCount = 0;
          // STEP 1: Load data
  
          // The feature dimension (typically this will be the Count of the array 
          // of the features vector known at runtime).
          int featureArrayDimension = 0;
          definedSchema = SchemaDefinition.Create(typeof(MLDataForAnalysisFactored));
  
          var properties = typeof(MLDataForAnalysisFactored).GetProperties();
  
          foreach (var property in properties)
          {
              if (property.PropertyType.IsArray)
              {
                  Array array = property.GetValue(mlDataList[0]) as Array;
                  featureArrayDimension = array.Length;
                  totalFeatureCount += featureArrayDimension;
  
                  //// Set the column type to be a known-size vector.
                  var vectorItemType = ((VectorDataViewType)definedSchema[property.Name].ColumnType)
                              .ItemType;
  
                  definedSchema[property.Name].ColumnType = new VectorDataViewType(vectorItemType,
                      featureArrayDimension);
  
                  var featureColumn = definedSchema[property.Name]
                  .ColumnType as VectorDataViewType;
  
                  Diag.Debug.WriteLine($"Is the size of the Feature array {property.Name} column known: " +
                      $"{featureColumn.IsKnownSize}.\nSize: {featureColumn.Size}");
              }
              else
              {
                  totalFeatureCount++;
              }
          }
          return (totalFeatureCount);
      }        

Anyway I can get away with using fixed array sizes. Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.

acrigney avatar Aug 05 '22 08:08 acrigney