[C++] CSV reader: Ability to not infer column types.
I'm trying to read CSV as is. All columns as strings. I don't know the schema of these CSVs and they will vary as they are provided by user.
Right now i'm using pandas.read_csv(dtype=str) which works great, but since final destination of these CSVs are parquet files it seems like much more efficient to use pyarrow.csv.read_csv in future, as soon as this becomes available :)
I tried things like pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: 'string'))) but it doesn't work.
Maybe I just didnt' find something that already exists? :)
Environment: Ubuntu Xenial Reporter: Bogdan Klichuk
Note: This issue was originally created as ARROW-5811. Please see the migration documentation for further details.
Antoine Pitrou / @pitrou:
No, convert_types must be the full mapping of column names to data types. C++ doesn't know about defaultdict...
We could add more inference options, though, for example to select the datatypes for which inference is enabled.
Antoine Pitrou / @pitrou: @wesm @nealrichardson do you have an idea about a desirable API here?
Wes McKinney / @wesm:
I think we need to create an abstract C++ type (or similar) that is a ConversionRule. We have other types of conversion rules where we have not defined an API yet, for example "timestamp with striptime-like format of $FORMAT". Whatever API we have, it needs to be extensible to accommodate new kinds of logic
Neal Richardson / @nealrichardson:
I think I'm not understanding the problem. What's missing from the column_types we already support? https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L69
Antoine Pitrou / @pitrou:
The request is for no inference to occur, without knowing the column names or the number of columns in advance (so you cannot pass an explicit column_types).
Neal Richardson / @nealrichardson:
In principle, a user could parse the header row of the CSV separately to identify the column names, then use that to define column_types mapping each name to string type. So are we just talking about how to facilitate that, whether/how to internalize that logic and expose it as a simple argument? Or is there something else?
If column_types didn't have to be a map, maybe that would help. Perhaps it could also accept an array of length equal to the number of columns, or a single value, in which case it would recycle that type for every column.
Antoine Pitrou / @pitrou: We're talking about C++ here. Dynamic typing isn't terribly idiomatic (though it's possible using std::variant) :-)
Wes McKinney / @wesm: Yeah, so we could define a conversion rule to return string or binary, and then add an option to set a default conversion rule (where currently we have an implicit default of "use type inference")
This issue hasn't had activity in a long time. If it's still being worked on, please leave a comment. Otherwise, it will be closed on 23rd June.
Labelled Status: Stale-Warning for tracking.
There is a PR up that suggests adding default_column_type option to the ConvertOptions.
See: https://github.com/apache/arrow/pull/47663/files. Is there any opinion on the state of that PR?
I feel it will need more work (if the contributor will be responsive as I suspect it was generated with AI) but I would like to see the feature included.
cc @pitrou