pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: pd.read_csv error when parse_dates used on index_col for dtype_backend="pyarrow"

Open MCRE-BE opened this issue 1 year ago • 12 comments
trafficstars

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime

import pandas as pd

df = pd.DataFrame({'a': [1], 'b': [datetime.datetime.now()]})
df.to_csv('test.csv', index=None)


a = pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
    dtype_backend="pyarrow",
    engine="pyarrow",
)

pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
)

the following fails :

pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
    dtype_backend="pyarrow",
)

Issue Description

returns ValueError: not all elements from date_cols are numpy arrays

ValueError Traceback (most recent call last) Cell In[78], [line 1](vscode-notebook-cell:?execution_count=78&line=1) ----> [1](vscode-notebook-cell:?execution_count=78&line=1) pd.read_csv( [2](vscode-notebook-cell:?execution_count=78&line=2) 'test.csv', [3](vscode-notebook-cell:?execution_count=78&line=3) parse_dates=['b'], [4](vscode-notebook-cell:?execution_count=78&line=4) index_col="b", [5](vscode-notebook-cell:?execution_count=78&line=5) dtype_backend="pyarrow", [6](vscode-notebook-cell:?execution_count=78&line=6) )

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend) 935 kwds_defaults = _refine_defaults_read( 936 dialect, 937 delimiter, (...) 944 dtype_backend=dtype_backend, 945 ) 946 kwds.update(kwds_defaults) --> 948 return _read(filepath_or_buffer, kwds)

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:617, in _read(filepath_or_buffer, kwds) 614 return parser 616 with parser: --> 617 return parser.read(nrows)

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:1748, in TextFileReader.read(self, nrows) 1741 nrows = validate_integer("nrows", nrows) 1742 try: 1743 # error: "ParserBase" has no attribute "read" 1744 ( 1745 index, 1746 columns, 1747 col_dict, -> 1748 ) = self._engine.read( # type: ignore[attr-defined] 1749 nrows 1750 ) 1751 except Exception: 1752 self.close()

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:333, in CParserWrapper.read(self, nrows) 330 data = {k: v for k, (i, v) in zip(names, data_tups)} 332 names, date_data = self._do_date_conversions(names, data) --> 333 index, column_names = self._make_index(date_data, alldata, names) 335 return index, column_names, date_data

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:371, in ParserBase._make_index(self, data, alldata, columns, indexnamerow) 369 elif not self._has_complex_date_col: 370 simple_index = self._get_simple_index(alldata, columns) --> 371 index = self._agg_index(simple_index) 372 elif self._has_complex_date_col: 373 if not self._name_processed:

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:468, in ParserBase._agg_index(self, index, try_parse_dates) 466 for i, arr in enumerate(index): 467 if try_parse_dates and self._should_parse_dates(i): --> 468 arr = self._date_conv( 469 arr, 470 col=self.index_names[i] if self.index_names is not None else None, 471 ) 473 if self.na_filter: 474 col_na_values = self.na_values

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:1142, in _make_date_converter..converter(col, *date_cols) 1139 return date_cols[0] 1141 if date_parser is lib.no_default: -> 1142 strs = parsing.concat_date_cols(date_cols) 1143 date_fmt = ( 1144 date_format.get(col) if isinstance(date_format, dict) else date_format 1145 ) 1147 with warnings.catch_warnings():

File parsing.pyx:1155, in pandas._libs.tslibs.parsing.concat_date_cols() ValueError: not all elements from date_cols are numpy arrays

Expected Behavior

As with dtype_backend="numpy_nullable" the index column should be parsed correctly.

It works with the following set-up, so likely an issue with the C parser (?)

Installed Versions

INSTALLED VERSIONS

commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : Dutch_Belgium.1252

pandas : 2.1.4 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0 setuptools : 69.2.0 pip : 24.0 Cython : None pytest : 8.1.1 hypothesis : None sphinx : 7.2.6 blosc : None feather : None xlsxwriter : 3.1.9 lxml.etree : 5.1.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.22.2 pandas_datareader : None bs4 : 4.12.3 bottleneck : 1.3.8 dataframe-api-compat: None fastparquet : None fsspec : 2024.3.0 gcsfs : None matplotlib : 3.8.3 numba : 0.59.0 numexpr : 2.9.0 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None pyxlsb : 1.0.10 s3fs : None scipy : 1.12.0 sqlalchemy : 2.0.28 tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : 0.22.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None

MCRE-BE avatar Mar 20 '24 14:03 MCRE-BE

Hm, this works on main for me.

Can you try a newer version of pandas (e.g. 2.2.1)?

lithomas1 avatar Mar 20 '24 22:03 lithomas1

@lithomas1 I could reproduce this error on pandas version 2.3.0

pmhatre1 avatar Mar 20 '24 23:03 pmhatre1

@lithomas1 Do you need any help in other triaging other defects/ bugs?

pmhatre1 avatar Mar 20 '24 23:03 pmhatre1

My bad, forgot to run the second command.

Nice catch.

Looks like parsing.concat_date_cols (which concatenates date_cols) is getting passed an ArrowExtensionArray.

That behavior is deprecated (to be removed in 3.0), so it should be easy to make this work for 3.0.

lithomas1 avatar Mar 21 '24 00:03 lithomas1

Can you try a newer version of pandas (e.g. 2.2.1)?

My bad. I should have checked and not have assumed the kernel I was using used the latest version. Thanks for the work.

MCRE-BE avatar Mar 21 '24 09:03 MCRE-BE

FWIW I was getting the same issue on pandas 2.2 when using "numpy_nullable" as the backend, but this appears to be working on pandas version 3.0.0.dev0+811.gcc427ec305?

WillAyd avatar Jul 24 '24 17:07 WillAyd

Still present in 2.2.2

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : Dutch_Belgium.1252

pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.1 dateutil : 2.9.0 setuptools : 72.2.0 pip : 24.2 Cython : None pytest : 8.3.2 hypothesis : None sphinx : 7.4.7 blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.3.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.4.0 dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 gcsfs : None matplotlib : 3.9.1 numba : 0.60.0 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.14.1 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : 0.23.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None

MCRE-BE avatar Aug 28 '24 05:08 MCRE-BE

Since this works on main I think it would be good to first and foremost add a test to the repository, so that we can prevent regression in the future. Would anyone be willing to contribute that?

As far as fixing on the 2.x series...we unfortunately don't have too many releases left for that. It is possible we can fix this on a 2.3 bugfix branch, although that may be rather challenging to fix given we don't have the problem on our main development branch

WillAyd avatar Oct 15 '24 16:10 WillAyd

@WillAyd : I can try and write a test for that in the next couple of days. Just have to dive into the requirements. Thanks for confirming that ir works on main

MCRE-BE avatar Oct 16 '24 16:10 MCRE-BE

@WillAyd : first try is here. Let me know if it's in the right spot before I make a pull request. https://github.com/pandas-dev/pandas/commit/6e83c4b8114539701c17b7ced0973ac32356eaf8

MCRE-BE avatar Oct 17 '24 05:10 MCRE-BE

Thanks! I think that's a great start. There's a few things to tweak but if you push up as a PR first we can give the feedback directly there

WillAyd avatar Oct 17 '24 12:10 WillAyd

I'll do that !

MCRE-BE avatar Oct 17 '24 18:10 MCRE-BE