machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Allow developers to supply their own function to infer column data types from data while loading CSVs

Open sevenzees opened this issue 1 year ago • 5 comments

Fixes #7141

Currently when you use LoadCsv or LoadCsvFromString without supplying data types for each column, the code will try to guess the data types based on the data in the CSV file. This is good, but the problem is that the default type inference code only considers bool, float, DateTime, and string for column types. Sometimes the user may need another data type, such as int, long, or double (see issue 6347 for an example where someone had a problem with the float data type that was chosen by default) but not know the structure of the data ahead of time.

I would like to be able to pass in my own custom type inference logic to override the default GuessKind implementation that is given in the library right now. If no custom guess type function is provided to the LoadCsv or LoadCsvFromString methods, then the code should work the same as it does today.

sevenzees avatar Apr 26 '24 21:04 sevenzees

@dotnet-policy-service agree

sevenzees avatar Apr 26 '24 21:04 sevenzees

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.57%. Comparing base (72cfdf6) to head (b6cd225).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7142      +/-   ##
==========================================
+ Coverage   68.55%   68.57%   +0.01%     
==========================================
  Files        1259     1259              
  Lines      255844   255969     +125     
  Branches    26434    26452      +18     
==========================================
+ Hits       175392   175518     +126     
- Misses      73717    73718       +1     
+ Partials     6735     6733       -2     
Flag Coverage Δ
Debug 68.57% <100.00%> (+0.01%) :arrow_up:
production 62.90% <100.00%> (+<0.01%) :arrow_up:
test 88.72% <100.00%> (+0.02%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/Microsoft.Data.Analysis/DataFrame.IO.cs 83.50% <100.00%> (+0.38%) :arrow_up:
...Microsoft.Data.Analysis.Tests/DataFrame.IOTests.cs 99.13% <100.00%> (+0.09%) :arrow_up:

... and 4 files with indirect coverage changes

codecov[bot] avatar Apr 26 '24 23:04 codecov[bot]

@JakeRadMSFT @luisquintanilla can I get you 2 to take a look at this please?

michaelgsharp avatar May 08 '24 19:05 michaelgsharp

Is anyone looking at this?

sevenzees avatar May 29 '24 01:05 sevenzees

@sevenzees I can take a look later today.

@JakeRadMSFT @luisquintanilla can you 2 please take a look at this.

michaelgsharp avatar May 29 '24 17:05 michaelgsharp