machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Long Column names are unexpectedly dropped in training

Open torronen opened this issue 3 years ago • 3 comments

System Information (please complete the following information):

  • OS & Version: Windows 11
  • ML.NET Version: ML.NET 1.6.0
  • .NET Version: .NET 6.0

Describe the bug Dataset may include long column names. In my case, they are about 150 characters long. Name includes a-Z 0-9 and the dash character. ColumnInference reports them correctly. However, after starting training with AutoML, the columns are not used for training. If there are only long titles error about missing "Features" column is thrown.

To Reproduce Steps to reproduce the behavior:

  1. Create dataset with long column names (numeric in my case)
  2. Column inference reports them correctly: ColumnInferenceResults columnInference = mlContext.Auto().InferColumns(TrainDataPath, LabelColumnName, groupColumns: false);
  3. Train: experimentResult = experiment.Execute(TrainDataView, ValidationDataView, columnInformation, null, progressHandler);
  4. Observe exception about missing Features.
  5. Rename columns to shorter manually, or in a loop to confirm training now works. This can be also used as a workaround for now.
var copyPipeline= mlContext.Transforms.CopyColumns("col" + i, col.Name);
OriginalTrainDataView = pipeline.Fit(OriginalTrainDataView).Transform(OriginalTrainDataView);

Note: I have tree-based algorithms enabled.

Expected behavior Long column names should be trained normally. If not possible, an exception should be received. Now user might think all data is being used to train but actually some columns may be ignored.

It is possible Verbose level would give information about this, but it is disabled by default in AutoML. I did not run separately with verbose output.

Additional data There may be many reasons why dataset could include long column names. For example, they may have name, id and settings of a measurement device included in the column name.

If possible, I'd like to know what is currently the column length limit even if this would be fixed. That helps know which fields have been ignored in earlier models.

torronen avatar Jan 12 '22 17:01 torronen

I'm not sure off of the top of my head the column name length limit. I can look into it.

How long are the names when you are noticing the behavior?

michaelgsharp avatar Jan 18 '22 18:01 michaelgsharp

About 150 characters - 200 characters long.

torronen avatar Jan 18 '22 18:01 torronen

Related: On Python side in LightGBM, I received error "Do not support non-ASCII characters in feature name." Upon close inspection I see one of the column titles has í instead of i. I did not test, but might be possible these columns are also silently dropped. Maybe some of the trainers would give a warning about it if verbosity would be higher? I'll update here if I have time to test. EDIT: It would appear í has not been dropped, although I could not verify if it actually is being used for training.

torronen avatar Jan 25 '22 14:01 torronen

@torronen thanks for reporting this. Are you still running into this issue?

Since this relates to the previous implementation of AutoML as well as an older version of ML.NET I'm inclined to close for now but if you're still seeing the issue I'll leave it open.

Thanks

luisquintanilla avatar Oct 07 '22 16:10 luisquintanilla

@luisquintanilla I am no longer using long column names, but I can re-try when I am upgrading code for this dataset to new AutoML implementation as I have not yet done it. We can close so I can re-open if I or someone else will notice it again.

torronen avatar Oct 08 '22 17:10 torronen