machinelearning
machinelearning copied to clipboard
DataFrame.LoadCsv can not load CSV with duplicate column names
Code: IDataView trainData = DataFrame.LoadCsv(TrainDatasetPath, separator: ';', header: true, guessRows: 100);
Gives exception: DataFrame already contains a column called Target20 (Parameter 'column')
Suggestion: It would be nice if LoadCsv would have the option to ignore or auto-rename duplicate columns. For small CSV files it is not a big problem, but for huge CSV files renaming headers is a hassle.
If anyone has same problem renamed header names can put in the parameter. This solves my issue. I do not know if LoadCsv should have this functionality inbuilt or not (or, issue to be closed or not)
LoadCsv with renamed columns:
string line1 = File.ReadLines(TrainDatasetPath).First();
string[] arr = line1.Split(';');
var duplicatedItems = arr.GroupBy(a => a)
.Where(g => g.Count() > 0)
.ToDictionary(g => g.Key, g => g.Count());
for (int i = arr.Length - 1; i >= 0; i--)
{
string item = arr[i];
if (!duplicatedItems.ContainsKey(item))
{
arr[i] = item;
continue;
}
if (duplicatedItems[item] > 1)
{
arr[i] = String.Format("{0}_{1}", item, duplicatedItems[item]);
}
duplicatedItems[item]--;
}
IDataView trainData = DataFrame.LoadCsv(TrainDatasetPath, separator: ';', header: true, columnNames: arr, guessRows: 100);
@luisquintanilla @torronen is this something we think should be built in to load? I can see it going both ways honestly. I would probably lean towards having it built in somehow.
@michaelgsharp by builtin do you mean duplicate columns are automatically renamed like the snippet @torronen shared?
+1 - I also experienced pain around large data files with multiple columns sharing names. I only had less than 50 columns and it was still troublesome to deal with.