machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

[DataFrame] LoadCsv throws IndexOutOfRangeException

Open luisquintanilla opened this issue 5 years ago • 5 comments
trafficstars

Using the following dataset: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

The dataset itself is large (1.4 GB) and sparse.

Using the following code:

var df = DataFrame.LoadCsv("vehicles.csv");

Throws the following exception:

System.IndexOutOfRangeException: 'Index was outside the bounds of the array.'

Here is the stacktrace:

   at Microsoft.Data.Analysis.DataFrame.GetColumnName(String[] columnNames, Int32 columnIndex)
   at Microsoft.Data.Analysis.DataFrame.CreateColumn(Type kind, String[] columnNames, Int32 columnIndex)
   at Microsoft.Data.Analysis.DataFrame.LoadCsv(Stream csvStream, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
   at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
   at ConsoleApp.Program.Main(String[] args) in C:\Users\luquinta.REDMOND\Development\UsedCarCraigslistSample\ConsoleApp\Program.cs:line 16

luisquintanilla avatar May 06 '20 06:05 luisquintanilla

Using similar functionality in Pandas does not throw these errors and the dataset loads successfully.

image

VehiclesEDA.zip

luisquintanilla avatar May 06 '20 07:05 luisquintanilla

I looked at this locally and found the issue. The description column has many ,s inside it, so we're not able to parse the number of columns correctly. Instead of detecting 25 columns, we detect 150+ columns. Let me explore adding a new IO csproj to help with LoadCsv bugs such as this one

pgovind avatar May 06 '20 18:05 pgovind

Great! Thanks @pgovind . This is a tricky dataset. I tried using LINQ and IO but ran into a similar issue so I understand the challenges. At least it's a way to see all the edge cases 🙂

luisquintanilla avatar May 06 '20 18:05 luisquintanilla

Make sure only use '\t'. ',' does not work properly. If you have a string stored in a cell as "string1, string2" and you happen to use ',' as the separator, Microsoft.Data.Analysis will interpret it as two cells. This is why I have to check and convert all my comma-separated txts to tab-separated with Excel first.

liugaocn avatar May 22 '20 18:05 liugaocn

Related to #5647

luisquintanilla avatar Mar 30 '22 23:03 luisquintanilla

I can't comment on #5647, but I'm seeing several folks say this issue is resolved as of a past version and want to share that it just happened to me last night.

I attempted to combine csvs that all have the same column headers using the below code and got the below error. I ended up having to use Pandas. But I would prefer to be able to immerse myself in c# more so I can get better using it. If I don't specify a separator, the data will load into a dataframe, but it will be misaligned in many places. Pandas handled it without issue.

If it's just me doing something incorrectly or not doing something I should be I would appreciate being pointed in the correct direction. It's difficult to find solid documentation on working with dataframes in c# like this.

EDIT: Updating my below code I can get it to read into a dataframe without issue. But I cannot save it to a csv without the issue persisting using DataFrame.SaveCsv(). I tried specifying a separator but it still results in misaligned data in the output file.

Context: using a Polyglot notebook Windows 11 Microsoft.Data.Analysis version: 0.21.0-preview.22621.2 .NET Version: 7.0.201

//Building one data frame from a folder of CSV files
string folder_path = @"<path/to/folder>";
var files = new DirectoryInfo(folder_path).GetFiles("*.*",SearchOption.TopDirectoryOnly);

var df = DataFrame.LoadCsv(files[0].FullName);

foreach (var file in files.Skip(0))
{
    var tempDF = DataFrame.LoadCsv(file.FullName);
    df = df.Append(tempDF.Rows,inPlace:true);
}
DataFrame.SaveCsv(df,@"path\to\csv")
Error: System.FormatException: Line 5 has less columns than expected
at Microsoft.Data.Analysis.DataFrame.GuessKind(Int32 col, List`1 read)
at Microsoft.Data.Analysis.DataFrame.ReadCsvLinesIntoDataFrame(WrappedStreamReaderOrStringReader wrappedReader, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn)
at Microsoft.Data.Analysis.DataFrame.LoadCsv(Stream csvStream, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
at Submission#6.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)

R0315 avatar Mar 15 '23 22:03 R0315