[Python] read_csv converts strings with leading zeros to integers
Describe the bug, including details regarding any error messages, version, and platform.
If you have data in a CSV file that is a quoted value with leading zeros, pyarrow will strip the leading zeros and convert the value to an integer. For example:
In [15]: with io.BytesIO(b'col1,col2\n"001","foo"\n"002","bar"') as buf:
...: tbl = pa.csv.read_csv(buf, parse_options=pa.csv.ParseOptions(quote_char='"'))
...: print(tbl)
pyarrow.Table
col1: int64
col2: string
----
col1: [[1,2]]
col2: [["foo","bar"]]
Component(s)
Python
Correct, this is something to be expected, see type inference docs. You can change the column type and disable type inference with column_types in pa.csv.ConvertOptions:
In [45]: convert_options=csv.ConvertOptions(
...: column_types={
...: 'col1': pa.string(),
...: }
...: )
...: with io.BytesIO(b'col1,col2\n"001","foo"\n"002","bar"') as buf:
...: tbl = csv.read_csv(buf, parse_options=pa.csv.ParseOptions(quote_char='"'),
...: convert_options=convert_options)
...: print(tbl)
...:
pyarrow.Table
col1: string
col2: string
----
col1: [["001","002"]]
col2: [["foo","bar"]]
I have encountered such default behavior when parsing product ID or telephone number that contain leading zeros.
I propose adding a "warning message" after ingesting a CSV file that contains such columns with leading zeros, indicates USER should use pa.string() if he wishes to preserve the leading zeros.
I’m not sure I’d be in favour of adding a warning, as it might be annoying or noisy in general use cases. However, I definitely agree that this is a good case to document — it would fit well in the Python User Guide and the Python Cookbook. Contributions are most welcome!