arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[C++] CSV reader: Ability to not infer column types.

Open asfimport opened this issue 6 years ago • 9 comments

I'm trying to read CSV as is. All columns as strings. I don't know the schema of these CSVs and they will vary as they are provided by user.

Right now i'm using pandas.read_csv(dtype=str) which works great, but since final destination of these CSVs are parquet files it seems like much more efficient to use pyarrow.csv.read_csv in future, as soon as this becomes available :)

I tried things like pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: 'string'))) but it doesn't work.

Maybe I just didnt' find something that already exists? :)

Environment: Ubuntu Xenial Reporter: Bogdan Klichuk

Note: This issue was originally created as ARROW-5811. Please see the migration documentation for further details.

asfimport avatar Jun 30 '19 21:06 asfimport

Antoine Pitrou / @pitrou: No, convert_types must be the full mapping of column names to data types. C++ doesn't know about defaultdict...

We could add more inference options, though, for example to select the datatypes for which inference is enabled.

asfimport avatar Jul 01 '19 08:07 asfimport

Antoine Pitrou / @pitrou: @wesm @nealrichardson do you have an idea about a desirable API here?

asfimport avatar Jul 17 '19 15:07 asfimport

Wes McKinney / @wesm: I think we need to create an abstract C++ type (or similar) that is a ConversionRule. We have other types of conversion rules where we have not defined an API yet, for example "timestamp with striptime-like format of $FORMAT". Whatever API we have, it needs to be extensible to accommodate new kinds of logic

asfimport avatar Jul 17 '19 15:07 asfimport

Neal Richardson / @nealrichardson: I think I'm not understanding the problem. What's missing from the column_types we already support? https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L69

asfimport avatar Jul 17 '19 15:07 asfimport

Antoine Pitrou / @pitrou: The request is for no inference to occur, without knowing the column names or the number of columns in advance (so you cannot pass an explicit column_types).

asfimport avatar Jul 17 '19 15:07 asfimport

Neal Richardson / @nealrichardson: In principle, a user could parse the header row of the CSV separately to identify the column names, then use that to define column_types mapping each name to string type. So are we just talking about how to facilitate that, whether/how to internalize that logic and expose it as a simple argument? Or is there something else?

If column_types didn't have to be a map, maybe that would help. Perhaps it could also accept an array of length equal to the number of columns, or a single value, in which case it would recycle that type for every column. 

 

asfimport avatar Jul 17 '19 15:07 asfimport

Antoine Pitrou / @pitrou: We're talking about C++ here. Dynamic typing isn't terribly idiomatic (though it's possible using std::variant) :-)

asfimport avatar Jul 17 '19 15:07 asfimport

Wes McKinney / @wesm: Yeah, so we could define a conversion rule to return string or binary, and then add an option to set a default conversion rule (where currently we have an implicit default of "use type inference")

asfimport avatar Jul 17 '19 15:07 asfimport

This issue hasn't had activity in a long time. If it's still being worked on, please leave a comment. Otherwise, it will be closed on 23rd June.

Labelled Status: Stale-Warning for tracking.

thisisnic avatar Jun 21 '25 08:06 thisisnic

There is a PR up that suggests adding default_column_type option to the ConvertOptions. See: https://github.com/apache/arrow/pull/47663/files. Is there any opinion on the state of that PR?

I feel it will need more work (if the contributor will be responsive as I suspect it was generated with AI) but I would like to see the feature included.

cc @pitrou

AlenkaF avatar Oct 24 '25 09:10 AlenkaF