arrow [Python] read_csv converts strings with leading zeros to integers

Describe the bug, including details regarding any error messages, version, and platform.

If you have data in a CSV file that is a quoted value with leading zeros, pyarrow will strip the leading zeros and convert the value to an integer. For example:

In [15]: with io.BytesIO(b'col1,col2\n"001","foo"\n"002","bar"') as buf:
    ...:     tbl = pa.csv.read_csv(buf, parse_options=pa.csv.ParseOptions(quote_char='"'))
    ...:     print(tbl)
pyarrow.Table
col1: int64
col2: string
----
col1: [[1,2]]
col2: [["foo","bar"]]

Component(s)

Python

Jun 18 '25 14:06 WillAyd

Correct, this is something to be expected, see type inference docs. You can change the column type and disable type inference with column_types in pa.csv.ConvertOptions:

In [45]: convert_options=csv.ConvertOptions(
    ...:     column_types={
    ...:         'col1': pa.string(),
    ...:     }
    ...: )
    ...: with io.BytesIO(b'col1,col2\n"001","foo"\n"002","bar"') as buf:
    ...:     tbl = csv.read_csv(buf, parse_options=pa.csv.ParseOptions(quote_char='"'),
    ...:                        convert_options=convert_options)
    ...:     print(tbl)
    ...: 
pyarrow.Table
col1: string
col2: string
----
col1: [["001","002"]]
col2: [["foo","bar"]]

Jun 23 '25 08:06 AlenkaF

I have encountered such default behavior when parsing product ID or telephone number that contain leading zeros.

I propose adding a "warning message" after ingesting a CSV file that contains such columns with leading zeros, indicates USER should use pa.string() if he wishes to preserve the leading zeros.

Jun 30 '25 14:06 ahmedsalah15

I’m not sure I’d be in favour of adding a warning, as it might be annoying or noisy in general use cases. However, I definitely agree that this is a good case to document — it would fit well in the Python User Guide and the Python Cookbook. Contributions are most welcome!

Jul 02 '25 12:07 AlenkaF