machinelearning
machinelearning copied to clipboard
[DataFrame] can't handle separators in data
Related to dotnet/corefxlab#2968
Looks like DataFrame can't handle CSV where the separator appears in the column data.
Repro
var frame = DataFrame.LoadCsv(fileName);
foreach (var row in frame.Rows)
{
Console.WriteLine(row[0]);
Console.WriteLine(row[1]);
Console.WriteLine(row[2]);
Console.WriteLine();
}
CSV contents:
Name,Age,Description
Paul,34,"Paul lives in Vermont, VA."
Victor,29,"Victor: Funny guy"
Maria,31,
Expected behavior
Prints the contents of the CSV
Actual behavior
Exception:
Unhandled exception. System.FormatException: Line 2 has less columns than expected
at Microsoft.Data.Analysis.DataFrame.GuessKind(Int32 col, List`1 read)
at Microsoft.Data.Analysis.DataFrame.LoadCsv(Stream csvStream, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
at ConsoleApp49.Program.Main(String[] args)
Maybe CSV parsing should get its own library in the BCL similar to JSON and XML? It is an extremely common need. CSVHelper is a great (and very popular) library but obviously you guys can not reference it in your packages.
Related to dotnet/corefxlab#2787
Unable to repro as of version 0.20.0-preview.22313.1
As Luis mentioned, this issue appears to have been solved at some point. Regardless, I'm adding unit tests that confirm this behavior here: https://github.com/dotnet/machinelearning/pull/6301.
Edit: I closed that PR and included those tests in a larger PR linked below.