machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

DataFrame.LoadCsv can not load CSV with duplicate column names

Open torronen opened this issue 2 years ago • 4 comments

Code: IDataView trainData = DataFrame.LoadCsv(TrainDatasetPath, separator: ';', header: true, guessRows: 100);

Gives exception: DataFrame already contains a column called Target20 (Parameter 'column')

Suggestion: It would be nice if LoadCsv would have the option to ignore or auto-rename duplicate columns. For small CSV files it is not a big problem, but for huge CSV files renaming headers is a hassle.

torronen avatar May 03 '22 09:05 torronen

If anyone has same problem renamed header names can put in the parameter. This solves my issue. I do not know if LoadCsv should have this functionality inbuilt or not (or, issue to be closed or not)

LoadCsv with renamed columns:

  string line1 = File.ReadLines(TrainDatasetPath).First();
                string[] arr = line1.Split(';');
                var duplicatedItems = arr.GroupBy(a => a)
                                       .Where(g => g.Count() > 0)
                                       .ToDictionary(g => g.Key, g => g.Count());

                for (int i = arr.Length - 1; i >= 0; i--)
                {
                    string item = arr[i];
                    if (!duplicatedItems.ContainsKey(item))
                    {
                        arr[i] = item;
                        continue;
                    }

                    if (duplicatedItems[item] > 1)
                    {
                        arr[i] = String.Format("{0}_{1}", item, duplicatedItems[item]);
                    }
                    
                    duplicatedItems[item]--;
                }

                IDataView trainData = DataFrame.LoadCsv(TrainDatasetPath, separator: ';', header: true, columnNames: arr, guessRows: 100);

torronen avatar May 03 '22 10:05 torronen

@luisquintanilla @torronen is this something we think should be built in to load? I can see it going both ways honestly. I would probably lean towards having it built in somehow.

michaelgsharp avatar May 09 '22 17:05 michaelgsharp

@michaelgsharp by builtin do you mean duplicate columns are automatically renamed like the snippet @torronen shared?

luisquintanilla avatar May 09 '22 18:05 luisquintanilla

+1 - I also experienced pain around large data files with multiple columns sharing names. I only had less than 50 columns and it was still troublesome to deal with.

beccamc avatar May 09 '22 21:05 beccamc