arrow-tools icon indicating copy to clipboard operation
arrow-tools copied to clipboard

feat: add support to specify data types for columns in command line

Open Mottl opened this issue 1 year ago • 7 comments

Hello! This PR will allow setting column data types as command line arguments:

Examples:

# Use Float32 for all the floats and Int32 for all the integers columns:
csv2parquet --i32='*' --f32='*' in.csv out.parquet 

# Use Float32 for all the floats and Int32 for all the integers columns,
# but Int64 for `foo' and `bar', and Float64 for `baz':
csv2parquet --i32='*' --f32='*' --i64=foo,bar --f64=baz 

# `__all__' can be used interchangeably with `*':
csv2parquet --i32=__all__ --f32=__all__ in.csv out.parquet 

Command line arguments --i32, --i64, --f32 and --f64 override data types provided with --schema-file

Mottl avatar Feb 17 '25 15:02 Mottl

Dominik, from my point of view, schema-file is decent when you have a persistent set of columns.

My case is different — many columns are added and removed an every new run and it's "expensive" for me to support schema.json.
Another point is about the default data type for floats (and integers) — the majority of ML stuff (both NNs and GBDTs) suggests Float32 for training due to performance and memory reasons. So it's desirable to have a switch to set all the floats (and, why not, integers) to 32-bit precision, leaving a couple (or none) of them with 64-bit version. That's why logic feels "inverted" 😀

Mottl avatar Feb 17 '25 17:02 Mottl

I see. In that case I'd suggest having a partial schemas support for flexibility and an option to allow/disallow certain types in the type inference. Would that address your use case?

I'm thinking how we can make the api surface not too tied to specific cases which would increase my maintenance burden.

domoritz avatar Feb 17 '25 19:02 domoritz

It's fine, I understand your point. Supporting schema.json files in my case is not a good option.

Mottl avatar Feb 18 '25 03:02 Mottl

Yeah, the schema is too cumbersome but what about my proposal to allow only certain types in the schema inference? Maybe --disallow_types=float64,int64. Ideally that would be something supported upstream.

domoritz avatar Feb 18 '25 04:02 domoritz

Uhm.. the thing is not disallowing types, the thing is to have a simple way to set a default bit size for floats and integers. Like setting all the columns to float32 and setting only some of them to float64 (or vice versa).

Mottl avatar Feb 18 '25 05:02 Mottl

I see. I can relate to the use case but the api with --i32='*' feels like it could be misunderstood. It feels like it should make all columns i32 which is not the case.

Also note that so far I have not added features that didn't have an implementation upstream (to reduce maintenance burden on me). So I'm hesitant to merge this.

I appreciate the effort you put into the pull request but I hope you understand my concerns.

domoritz avatar Feb 18 '25 05:02 domoritz

yes, sure, no problem

Mottl avatar Feb 18 '25 05:02 Mottl