machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

[DataFrame] can't handle separators in data

Open terrajobst opened this issue 5 years ago • 2 comments
trafficstars

Related to dotnet/corefxlab#2968

Looks like DataFrame can't handle CSV where the separator appears in the column data.

Repro

var frame = DataFrame.LoadCsv(fileName);

foreach (var row in frame.Rows)
{
    Console.WriteLine(row[0]);
    Console.WriteLine(row[1]);
    Console.WriteLine(row[2]);
    Console.WriteLine();
}

CSV contents:

Name,Age,Description
Paul,34,"Paul lives in Vermont, VA."
Victor,29,"Victor: Funny guy"
Maria,31,

Expected behavior

Prints the contents of the CSV

Actual behavior

Exception:

Unhandled exception. System.FormatException: Line 2 has less columns than expected
   at Microsoft.Data.Analysis.DataFrame.GuessKind(Int32 col, List`1 read)
   at Microsoft.Data.Analysis.DataFrame.LoadCsv(Stream csvStream, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
   at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
   at ConsoleApp49.Program.Main(String[] args)

terrajobst avatar Sep 11 '20 15:09 terrajobst

Maybe CSV parsing should get its own library in the BCL similar to JSON and XML? It is an extremely common need. CSVHelper is a great (and very popular) library but obviously you guys can not reference it in your packages.

MgSam avatar Sep 16 '20 15:09 MgSam

Related to dotnet/corefxlab#2787

luisquintanilla avatar Sep 23 '20 21:09 luisquintanilla

Unable to repro as of version 0.20.0-preview.22313.1

luisquintanilla avatar Aug 23 '22 18:08 luisquintanilla

As Luis mentioned, this issue appears to have been solved at some point. Regardless, I'm adding unit tests that confirm this behavior here: https://github.com/dotnet/machinelearning/pull/6301.

Edit: I closed that PR and included those tests in a larger PR linked below.

dakersnar avatar Aug 23 '22 22:08 dakersnar