mir-algorithm
mir-algorithm copied to clipboard
[feature request] Add numpy.genfromtxt()
https://forum.dlang.org/thread/[email protected]
Hi,
I'm just wondering what is the best way to read CSV data file into Mir (2d array) ndslice? Esp. if it can parse date into int/float.
I searched a bit, but can't find any example.
Thanks.
So, can we add numpy.genfromtxt():
https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html
Esp. with column converters?
https://github.com/Kriyszig/magpie
BTW, found this for the reference:
from_csv(string path, int indexDepth = 1, int columnDepth = 1,int[] columns = [], char sep = ',')
For Mir to catch up with numpy, being able to easily read CSV to import data is a must to attract data scientists.
In numpy/pandas, it's just one liner.
mir-ion can deserialize Mir and common 2D arrays from Binary Ion, Text Ion, JSON, MsgPack. Plus YAML support is coming soon.
We just need to add CSV there.
void serializeFromCsv(S)(scope ref S serializer, scop const(char)[] csv, /+ other params+/)
{
///on the fly
/// - parse csv
/// - recognise cell patters: integers, floats, timestamps, true, false, null; all others are strings
/// - serializer.putValue(value)
}
- The serialiser interface is here. https://github.com/libmir/mir-ion/blob/master/source/mir/ser/interfaces.d#L18
-
[Timestamp](http://mir-algorithm.libmir.org/mir_timestamp.html).fromYamlString
andTimestamp.fromString
have recognition API - For
long
anddouble
recognition fromString
If we have this function implemented I will do the rest. The API will looks like:
auto matrix = csvText.deserializeCsv!(Slice!(2, double*));
I don't really have a good sense of how to use mir.ion
. I see that it references a cookbook, but it would be good to have a simple example written in D on the readme.md.
auto matrix = deserializeCsv!(Slice!(2, double*));
Am I correct that the deserializeCsv
would handle the transformation of what is read from a .csv
file into the Slice
type? It wouldn't handle the reading of the file itself.
The fixed version should looks like:
auto matrix = csvText.deserializeCsv!(Slice!(2, double*));
deserializeCsv
will call serializeFromCsv
to serialize CSV to binary Ion DOM (it is super fast), and then, it will call deserializeValue
, which is already implemented.
Mir always splits deserializ=sation logic in two stages: First, data to binary Ion; second, binary ion to value.
The idea is that binary ion to value works fast and unified across all types of formats.
Assume raw major notation.
Then CSV should have the following options to be converted to Ion:
- matrix
- an array of records with inner keys in the first raw
- a record of arrays with outer keys in the first column
- a record of records with inner keys in the first raw and outer keys in the first column
These four kinds of conversion kinds allow converting CSV to Ion on the fly.
Also, a simple transposition with full memory allocation will allow other four conversions:
- transposed matrix
- an array of records with inner keys in the first column
- a record of arrays with outer keys in the first raw
- a record of records with inner keys in the first column and outer keys in the first raw
We can define CSV algebraic as follow:
module mir.algebraic_alias.csv;
/++
Definition union for $(LREF JsonAlgebraic).
+/
import mir.algebraic;
/++
CSV tagged algebraic alias.
+/
alias CsvAlgebraic = Algebraic!Csv_;
union Csv_
{
/// Used for empty CSV scalar like one between two separators: `,,`
typeof(null) null_;
/// Used for false, true, False, True, and friends. Follows YAML conversion
bool boolean;
///
long integer;
///
double float_;
///
immutable(char)[] string;
}
Added draft: https://github.com/libmir/mir-ion/blob/master/source/mir/csv.d
I don't know if this applies or not, but one thing that is very useful in R's read.csv
[1] function is the ability to identify certain strings as representing NAs. So for instance, setting it so that "NA" or "N/A" or "#N/A" (you could even do -999) are known as NAs, then the entire column will be the floating point type even if there were originally some strings in it. It makes it so you don't need to read the column as a string and process it later.
Not sure what other features that would be useful, but that's one that sticks out.
[1] https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table
Also check numpy.genfromtxt()
and pandas.read_csv()
.
I think handle invalid nan values (including empty entry i.e. ,,
), and allow user pass in column converter callbacks are two most important features.
numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=" !#$%&'()*+, -./:;<=>?@[\]^{|}~", replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes', *, ndmin=0, like=None)[source]
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
column converter callbacks
E.g. user can plug in his own date string to int/float converter function.
@mw66 I can't see a reason why we may even need to provide an in-column converter callback while we have the power of mir.algebraic
, mir.ndslice
, and mir.functional
. It may be confusing but the experience of people from scripting languages are and mine are so different that it hard to get why the want to do something one way while there is a 'common' way to do it.
Maybe we could share our experiences with each other.
Let's do the following. Please provide a CSV/TSV data sample and an example in any programming language of how you handle it. Then I will provide an example of how we can handle the data in Mir. When we do so it would be easy to figure out a good API.
@mw66 I have updated the draft with callback support. Please check the first unittest.
why we may even need to provide an in-column converter callback
The reason is simple: the ndslice's type is homogeneous, it's just double/float, but the input can be mixed with date string: e.g.
Date,Open,High,Low,Close,Volume
2021-01-21 09:30:00,133.8,134.43,133.59,134.0,9166695
2021-01-21 09:35:00,134.25,135.0,134.19,134.5,4632863
You need the converter at import time, to convert the date string into double, not afterwards (how do you store string in the ndslice mixed with floats then)?
https://github.com/libmir/mir-ion/blob/master/source/mir/csv.d#L518
So, ndslice can have mixed types for different columns? (sorry if I miss something, I'm a newbie to mir).
So, ndslice can have mixed types for different columns? (sorry if I miss something, I'm a newbie to mir).
No, ndslice can't. However, you aren't limited to having ndslice matrix of double. Five other options in Mir can do it:
- WIP
Series!(Timestamp*, double*, 2)
support:mir.series
(mir-algorithm) can store the index of one type and a ndslice data matrix of another type.mir-csv
will be able to load the index from the first column. - Implemented
Slice!(CsvAlgebraic*, 2)
support:mir.algebraic.Algebraic
(mir-core) includingmir.aligebraic_alias.csv
(mir-ion) can store value of different type. So, you could load a ndslice matrix of algebraic types and then process it. - Implemented
Tuple!(Timestamp[], double[], double[], double[], double[], double[])
:mir.functional.Tuple
(mir-core, latest release), which is used in the example, is a kind of static array of different predefined types. - A struct of arrays:
import mir.timestamp: Timestamp;
import mir.serde: serdeKeys;
struct Data
{
@serdeKeys("Date")
Timestamp[] date;
@serdeKeys("Open")
double[] open;
...
}
- Associative arrays of columns:
CsvAlgebraic[][string]
And the callback for column conversion is implemented as you want.
I would want to do something like this:
auto text = "1,2\n3,4\n5,#N/A\n";
auto matrix = //TODO
matrix.should == [[1.0, 2], [3.0, 4], [5.0, double.nan]];
I saw something with nulls in your UT, but I didn't think you could mix nulls with other data in an array.
With respect to column converters/classes, I think that is just a scripting language way to try to enforce types. It's telling the function how to process the text from the csv. For a scripting language it is good to have it when you need it, but if you have a type system then usually you would just tell it that it is a particular type (provided that type is supported).
I think one difficulty is that this involves a lot of different functionality that don't have the best documentation or examples. Even if it is incredibly powerful, more work on that front may help reduce the burden on future users.
I saw something with nulls in your UT, but I didn't think you could mix nulls with other data in an array.
mir.ion.conv.serde!(mir.algebraic.Nullable!double[][])(text.Csv)
I saw something with nulls in your UT, but I didn't think you could mix nulls with other data in an array. mir.ion.conv.serde!(mir.algebraic.Nullable!double[][])(text.Csv)
@jmh530 The full example
/// Converting NA to NaN
unittest
{
import mir.csv;
import mir.algebraic: Nullable, visit;
import mir.ion.conv: serde;
import mir.ndslice: Slice, map, slice;
import mir.ser.text: serializeText;
import mir.test: should;
auto text = "1,2\n3,4\n5,#N/A\n";
auto matrix = text
.Csv
.serde!(Slice!(Nullable!double*, 2))
.map!(visit!((double x) => x, (_) => double.nan))
.slice;
matrix.serializeText.should == q{[[1.0,2.0],[3.0,4.0],[5.0,nan]]};
}
why we may even need to provide an in-column converter callback
The reason is simple: the ndslice's type is homogeneous, it's just double/float, but the input can be mixed with date string: e.g.
Date,Open,High,Low,Close,Volume 2021-01-21 09:30:00,133.8,134.43,133.59,134.0,9166695 2021-01-21 09:35:00,134.25,135.0,134.19,134.5,4632863
You need the converter at import time, to convert the date string into double, not afterwards (how do you store string in the ndslice mixed with floats then)?
BTW, here is the Python code, the converter is a dictionary keyed by column index/name:
import numpy as np
from datetime import datetime
str2date = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
data = np.genfromtxt('data.csv',dtype=None,names=True, delimiter=',', converters = {0: str2date})
For a scripting language it is good to have it when you need it, but if you have a type system then usually you would just tell it that it is a particular type (provided that type is supported).
Without converters in D, then how do you handle the date string in my above example? (The other columns are floats)
Without converters in D, then how do you handle the date string in my above example?
mir.csv
recognise it is a timestamp. It do it for all input fields: it parses numbers, booleans and timestamps.
Without converters in D, then how do you handle the date string in my above example?
mir.csv
recognise it is a timestamp. It do it for all input fields: it parses numbers, booleans and timestamps.
Then it's in fixed format, but in real life data you may encounter all kinds of formats e.g:
2022/09/24
2022.09.24
2022-09-24
...
24/09/2022
...
24-sep-2022
24-sept-2022
...
09/24/2022
09 24 2022
09/24/22
...
September/24/2022
...
The list can go on and on
You'd better allow user to plug in his own converters for his own data.
And this is just for date, but the data may contain all kinds of different strings that the user want to convert to numbers in his own way.
You'd better allow user to plug in his own converters for his own data.
It is allowed now. Please check the conversionFinalizer
.
You'd better allow user to plug in his own converters for his own data.
It is allowed now. Please check the
conversionFinalizer
.
conversionFinalizer : (
unquotedString,
scalar,
columnIndex,
columnName)
So you pass out the columnIndex and columnName to let the user branch in his function to dispatch to different columns? This looks ugly, and may incur code duplication, e.g. two different csv, with col-1, col-2 swapped:
in Python, pass in dict, 2 one-liner:
data1 = np.genfromtxt("data1.csv", ..., converters = {1: cvtr1, 2: cvtr2})
data2 = np.genfromtxt("data2.csv", ..., converters = {2: cvtr1, 1: cvtr2})
With conversionFinalizer in D:
conversionFinalizer1 (...) {
switch (columnIndex) {
1: return cvtr1(unquotedString);
2: return cvtr2(unquotedString);
}
}
conversionFinalizer2 (...) {
switch (columnIndex) {
1: return cvtr2(unquotedString);
2: return cvtr1(unquotedString);
}
}
too verbose.
Why not use the Python dictionary format, and let the Mir do such branching in the library?
It is less verbose in Python because it is a scripting language. If you think it will be less verbose please give an example in D. But this should be a full-featured solution like the current one.
It has nothing to do with Python being a scripting language. It's about the api function interface, and who (the lib or the user) is responsible for the column converters dispatching.
Example in D can just follow the Python api:
double cvtr1(string str) {return ...;}
double cvtr2(string str) {return ...;}
data1 = mir.genfromtxt("data1.csv", ..., [1: cvtr1, 2: cvtr2]); // D does not have named arg yet, let just use positional arg
data2 = mir.genfromtxt("data2.csv", ..., [2: cvtr1, 1: cvtr2]); // pass in D's AA.
So why this code cannot be implemented in the D library?
It can be added. But that isn't generalised solution. We could do an additional overload like that. Note, that this can be just a wrapper around the verbose solution.
-
I don't understand what you mean by "that isn't generalised solution", can you elaborate? I think passing the dictionary is as general as the current
conversionFinalizer
. -
please do the overload. As long as the dispatching is inside the library code, the user calling code will be tidy and succinct.
- I don't understand what you mean by "that isn't generalised solution", can you elaborate? I think passing the dictionary is as general as the current
conversionFinalizer
.
conversionFinalizer
provides much more context for user.