mir-algorithm icon indicating copy to clipboard operation
mir-algorithm copied to clipboard

[feature request] Add numpy.genfromtxt()

Open mw66 opened this issue 1 year ago • 32 comments

https://forum.dlang.org/thread/[email protected]

Hi,

I'm just wondering what is the best way to read CSV data file into Mir (2d array) ndslice? Esp. if it can parse date into int/float.

I searched a bit, but can't find any example.

Thanks.

So, can we add numpy.genfromtxt():

https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html

Esp. with column converters?

mw66 avatar Sep 21 '22 15:09 mw66

https://github.com/Kriyszig/magpie

BTW, found this for the reference:

from_csv(string path, int indexDepth = 1, int columnDepth = 1,int[] columns = [], char sep = ',')

mw66 avatar Sep 21 '22 16:09 mw66

For Mir to catch up with numpy, being able to easily read CSV to import data is a must to attract data scientists.

In numpy/pandas, it's just one liner.

mw66 avatar Sep 21 '22 20:09 mw66

mir-ion can deserialize Mir and common 2D arrays from Binary Ion, Text Ion, JSON, MsgPack. Plus YAML support is coming soon.

We just need to add CSV there.

void serializeFromCsv(S)(scope ref S serializer, scop const(char)[] csv, /+ other params+/)
{
///on the fly
 /// - parse csv

 /// - recognise  cell patters: integers, floats, timestamps, true, false, null; all others are strings
 /// - serializer.putValue(value)
}
  • The serialiser interface is here. https://github.com/libmir/mir-ion/blob/master/source/mir/ser/interfaces.d#L18
  • [Timestamp](http://mir-algorithm.libmir.org/mir_timestamp.html).fromYamlString and Timestamp.fromString have recognition API
  • For long and double recognition fromString

If we have this function implemented I will do the rest. The API will looks like:

auto matrix = csvText.deserializeCsv!(Slice!(2, double*));

9il avatar Sep 22 '22 03:09 9il

I don't really have a good sense of how to use mir.ion. I see that it references a cookbook, but it would be good to have a simple example written in D on the readme.md.

jmh530 avatar Sep 22 '22 10:09 jmh530

auto matrix = deserializeCsv!(Slice!(2, double*));

Am I correct that the deserializeCsv would handle the transformation of what is read from a .csv file into the Slice type? It wouldn't handle the reading of the file itself.

jmh530 avatar Sep 22 '22 14:09 jmh530

The fixed version should looks like:

auto matrix = csvText.deserializeCsv!(Slice!(2, double*));

deserializeCsv will call serializeFromCsv to serialize CSV to binary Ion DOM (it is super fast), and then, it will call deserializeValue, which is already implemented.

Mir always splits deserializ=sation logic in two stages: First, data to binary Ion; second, binary ion to value.

The idea is that binary ion to value works fast and unified across all types of formats.

9il avatar Sep 23 '22 03:09 9il

Assume raw major notation.

Then CSV should have the following options to be converted to Ion:

  • matrix
  • an array of records with inner keys in the first raw
  • a record of arrays with outer keys in the first column
  • a record of records with inner keys in the first raw and outer keys in the first column

These four kinds of conversion kinds allow converting CSV to Ion on the fly.

Also, a simple transposition with full memory allocation will allow other four conversions:

  • transposed matrix
  • an array of records with inner keys in the first column
  • a record of arrays with outer keys in the first raw
  • a record of records with inner keys in the first column and outer keys in the first raw

We can define CSV algebraic as follow:

module mir.algebraic_alias.csv;
/++
Definition union for $(LREF JsonAlgebraic).
+/
import mir.algebraic;
/++
CSV tagged algebraic alias.
+/
alias CsvAlgebraic = Algebraic!Csv_;

union Csv_
{
    /// Used for empty CSV scalar like one between two separators: `,,`
    typeof(null) null_;
    /// Used for false, true, False, True, and friends. Follows YAML conversion
    bool boolean;
    ///
    long integer;
    ///
    double float_;
    ///
    immutable(char)[] string;
}

9il avatar Sep 23 '22 04:09 9il

Added draft: https://github.com/libmir/mir-ion/blob/master/source/mir/csv.d

9il avatar Sep 23 '22 06:09 9il

I don't know if this applies or not, but one thing that is very useful in R's read.csv [1] function is the ability to identify certain strings as representing NAs. So for instance, setting it so that "NA" or "N/A" or "#N/A" (you could even do -999) are known as NAs, then the entire column will be the floating point type even if there were originally some strings in it. It makes it so you don't need to read the column as a string and process it later.

Not sure what other features that would be useful, but that's one that sticks out.

[1] https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table

jmh530 avatar Sep 23 '22 12:09 jmh530

Also check numpy.genfromtxt() and pandas.read_csv().

I think handle invalid nan values (including empty entry i.e. ,,), and allow user pass in column converter callbacks are two most important features.

numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=" !#$%&'()*+, -./:;<=>?@[\]^{|}~", replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes', *, ndmin=0, like=None)[source]

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

mw66 avatar Sep 23 '22 16:09 mw66

column converter callbacks

E.g. user can plug in his own date string to int/float converter function.

mw66 avatar Sep 23 '22 22:09 mw66

@mw66 I can't see a reason why we may even need to provide an in-column converter callback while we have the power of mir.algebraic, mir.ndslice, and mir.functional. It may be confusing but the experience of people from scripting languages are and mine are so different that it hard to get why the want to do something one way while there is a 'common' way to do it.

Maybe we could share our experiences with each other.

Let's do the following. Please provide a CSV/TSV data sample and an example in any programming language of how you handle it. Then I will provide an example of how we can handle the data in Mir. When we do so it would be easy to figure out a good API.

9il avatar Sep 24 '22 03:09 9il

@mw66 I have updated the draft with callback support. Please check the first unittest.

9il avatar Sep 24 '22 06:09 9il

why we may even need to provide an in-column converter callback

The reason is simple: the ndslice's type is homogeneous, it's just double/float, but the input can be mixed with date string: e.g.

Date,Open,High,Low,Close,Volume
2021-01-21 09:30:00,133.8,134.43,133.59,134.0,9166695
2021-01-21 09:35:00,134.25,135.0,134.19,134.5,4632863

You need the converter at import time, to convert the date string into double, not afterwards (how do you store string in the ndslice mixed with floats then)?

mw66 avatar Sep 24 '22 06:09 mw66

https://github.com/libmir/mir-ion/blob/master/source/mir/csv.d#L518

So, ndslice can have mixed types for different columns? (sorry if I miss something, I'm a newbie to mir).

mw66 avatar Sep 24 '22 07:09 mw66

So, ndslice can have mixed types for different columns? (sorry if I miss something, I'm a newbie to mir).

No, ndslice can't. However, you aren't limited to having ndslice matrix of double. Five other options in Mir can do it:

  • WIP Series!(Timestamp*, double*, 2) support: mir.series (mir-algorithm) can store the index of one type and a ndslice data matrix of another type. mir-csv will be able to load the index from the first column.
  • Implemented Slice!(CsvAlgebraic*, 2) support: mir.algebraic.Algebraic (mir-core) including mir.aligebraic_alias.csv (mir-ion) can store value of different type. So, you could load a ndslice matrix of algebraic types and then process it.
  • Implemented Tuple!(Timestamp[], double[], double[], double[], double[], double[]): mir.functional.Tuple (mir-core, latest release), which is used in the example, is a kind of static array of different predefined types.
  • A struct of arrays:
import mir.timestamp: Timestamp;
import mir.serde: serdeKeys;
struct Data
{
    @serdeKeys("Date")
    Timestamp[] date;

    @serdeKeys("Open")
    double[] open;
    ...
}
  • Associative arrays of columns: CsvAlgebraic[][string]

And the callback for column conversion is implemented as you want.

9il avatar Sep 24 '22 08:09 9il

I would want to do something like this:

auto text = "1,2\n3,4\n5,#N/A\n";
auto matrix = //TODO
matrix.should == [[1.0, 2], [3.0, 4], [5.0, double.nan]];

I saw something with nulls in your UT, but I didn't think you could mix nulls with other data in an array.

With respect to column converters/classes, I think that is just a scripting language way to try to enforce types. It's telling the function how to process the text from the csv. For a scripting language it is good to have it when you need it, but if you have a type system then usually you would just tell it that it is a particular type (provided that type is supported).

I think one difficulty is that this involves a lot of different functionality that don't have the best documentation or examples. Even if it is incredibly powerful, more work on that front may help reduce the burden on future users.

jmh530 avatar Sep 24 '22 12:09 jmh530

I saw something with nulls in your UT, but I didn't think you could mix nulls with other data in an array.

mir.ion.conv.serde!(mir.algebraic.Nullable!double[][])(text.Csv)

9il avatar Sep 24 '22 12:09 9il

I saw something with nulls in your UT, but I didn't think you could mix nulls with other data in an array. mir.ion.conv.serde!(mir.algebraic.Nullable!double[][])(text.Csv)

@jmh530 The full example

/// Converting NA to NaN
unittest
{
    import mir.csv;
    import mir.algebraic: Nullable, visit;
    import mir.ion.conv: serde;
    import mir.ndslice: Slice, map, slice;
    import mir.ser.text: serializeText;
    import mir.test: should;

    auto text = "1,2\n3,4\n5,#N/A\n";
    auto matrix = text
        .Csv
        .serde!(Slice!(Nullable!double*, 2))
        .map!(visit!((double x) => x, (_) => double.nan))
        .slice;

    matrix.serializeText.should == q{[[1.0,2.0],[3.0,4.0],[5.0,nan]]};
}

9il avatar Sep 24 '22 12:09 9il

why we may even need to provide an in-column converter callback

The reason is simple: the ndslice's type is homogeneous, it's just double/float, but the input can be mixed with date string: e.g.

Date,Open,High,Low,Close,Volume
2021-01-21 09:30:00,133.8,134.43,133.59,134.0,9166695
2021-01-21 09:35:00,134.25,135.0,134.19,134.5,4632863

You need the converter at import time, to convert the date string into double, not afterwards (how do you store string in the ndslice mixed with floats then)?

BTW, here is the Python code, the converter is a dictionary keyed by column index/name:

import numpy as np
from datetime import datetime

str2date = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

data = np.genfromtxt('data.csv',dtype=None,names=True, delimiter=',', converters = {0: str2date})

mw66 avatar Sep 24 '22 16:09 mw66

For a scripting language it is good to have it when you need it, but if you have a type system then usually you would just tell it that it is a particular type (provided that type is supported).

Without converters in D, then how do you handle the date string in my above example? (The other columns are floats)

mw66 avatar Sep 24 '22 16:09 mw66

Without converters in D, then how do you handle the date string in my above example?

mir.csv recognise it is a timestamp. It do it for all input fields: it parses numbers, booleans and timestamps.

9il avatar Sep 24 '22 16:09 9il

Without converters in D, then how do you handle the date string in my above example?

mir.csv recognise it is a timestamp. It do it for all input fields: it parses numbers, booleans and timestamps.

Then it's in fixed format, but in real life data you may encounter all kinds of formats e.g:

2022/09/24
2022.09.24
2022-09-24
...
24/09/2022
...
24-sep-2022
24-sept-2022
...
09/24/2022
09 24 2022
09/24/22
...
September/24/2022
...
The list can  go on and on

You'd better allow user to plug in his own converters for his own data.

And this is just for date, but the data may contain all kinds of different strings that the user want to convert to numbers in his own way.

mw66 avatar Sep 24 '22 16:09 mw66

You'd better allow user to plug in his own converters for his own data.

It is allowed now. Please check the conversionFinalizer.

9il avatar Sep 24 '22 16:09 9il

You'd better allow user to plug in his own converters for his own data.

It is allowed now. Please check the conversionFinalizer.

conversionFinalizer : (
            unquotedString,
            scalar,
            columnIndex,
            columnName)

So you pass out the columnIndex and columnName to let the user branch in his function to dispatch to different columns? This looks ugly, and may incur code duplication, e.g. two different csv, with col-1, col-2 swapped:

in Python, pass in dict, 2 one-liner:

data1 = np.genfromtxt("data1.csv", ..., converters = {1: cvtr1, 2: cvtr2})
data2 = np.genfromtxt("data2.csv", ..., converters = {2: cvtr1, 1: cvtr2})

With conversionFinalizer in D:

conversionFinalizer1 (...) {
  switch (columnIndex) {
    1: return cvtr1(unquotedString);
    2: return cvtr2(unquotedString);
  }
}

conversionFinalizer2 (...) {
  switch (columnIndex) {
    1: return cvtr2(unquotedString);
    2: return cvtr1(unquotedString);
  }
}

too verbose.

Why not use the Python dictionary format, and let the Mir do such branching in the library?

mw66 avatar Sep 24 '22 17:09 mw66

It is less verbose in Python because it is a scripting language. If you think it will be less verbose please give an example in D. But this should be a full-featured solution like the current one.

9il avatar Sep 24 '22 17:09 9il

It has nothing to do with Python being a scripting language. It's about the api function interface, and who (the lib or the user) is responsible for the column converters dispatching.

Example in D can just follow the Python api:

double cvtr1(string str) {return ...;}
double cvtr2(string str) {return ...;}

data1 = mir.genfromtxt("data1.csv", ...,  [1: cvtr1, 2: cvtr2]);  // D does not have named arg yet, let just use positional arg
data2 = mir.genfromtxt("data2.csv", ...,  [2: cvtr1, 1: cvtr2]);  // pass in D's AA.

So why this code cannot be implemented in the D library?

mw66 avatar Sep 24 '22 17:09 mw66

It can be added. But that isn't generalised solution. We could do an additional overload like that. Note, that this can be just a wrapper around the verbose solution.

9il avatar Sep 24 '22 18:09 9il

  1. I don't understand what you mean by "that isn't generalised solution", can you elaborate? I think passing the dictionary is as general as the current conversionFinalizer.

  2. please do the overload. As long as the dispatching is inside the library code, the user calling code will be tidy and succinct.

mw66 avatar Sep 24 '22 18:09 mw66

  • I don't understand what you mean by "that isn't generalised solution", can you elaborate? I think passing the dictionary is as general as the current conversionFinalizer.

conversionFinalizer provides much more context for user.

9il avatar Sep 24 '22 18:09 9il