pandas
pandas copied to clipboard
BUG: Concat for two dataframes row-wise fails if one of columns is datetime and iterrows was used
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import datetime
data = {'Column1': [1, 2, 3, 33],
'Column2': [4, 5, 6, 66],
'Column3': [7, 8, 9, 99]}
df = pd.DataFrame(data)
# once the below is uncommented, the pd.concat would fail...
# df["Column2"] = datetime.datetime(2018, 1, 1)
# ... if this is uncommented, it would work though
# df["Column2"] = datetime.date(2018, 1, 1)
random_rows = pd.DataFrame([row for _, row in df.sample(n=3).iterrows()])
df = pd.concat([df, random_rows])
print(df)
Issue Description
The pd.concat
of two dataframes fails when two conditions are met:
- one of the columns is
datetime
(if it isdate
everything works though) - the second dataframe in
concat
was constructed using.iterrows()
Changing iterrows()
to itertuples()
(in case something broke down in types) doesn't solve the problem
Expected Behavior
It works with the date
object, so would expect it to be a bug.
Installed Versions
INSTALLED VERSIONS
commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.7.final.0 python-bits : 64 OS : Darwin OS-release : 23.1.0 Version : Darwin Kernel Version 23.1.0: Mon Oct 9 21:28:12 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T8103 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 2.1.4 numpy : 1.26.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None bs4 : None bottleneck : 1.3.5 dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : 2.8.7 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
Can you post the traceback?
I am not able to replicate on main.
Can you post the traceback?
cd /Users/ilko/Code/clay/backend ; /usr/bin/env /Users/ilko/Code/clay/backend/.conda/bin/python /Users/ilko/.vscode
/extensions/ms-python.python-2023.22.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 63255 -- main.py --init
Traceback (most recent call last):
File "/Users/ilko/Code/clay/backend/main.py", line 17, in <module>
df = pd.concat([df, random_rows])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/reshape/concat.py", line 393, in concat
return op.get_result()
^^^^^^^^^^^^^^^
File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/reshape/concat.py", line 680, in get_result
new_data = concatenate_managers(
^^^^^^^^^^^^^^^^^^^^^
File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/internals/concat.py", line 189, in concatenate_managers
values = _concatenate_join_units(join_units, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
concat_values = concat_compat(to_concat, axis=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/dtypes/concat.py", line 132, in concat_compat
return cls._concat_same_type(to_concat_eas)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/arrays/datetimelike.py", line 2251, in _concat_same_type
new_obj = super()._concat_same_type(to_concat, axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/arrays/_mixins.py", line 230, in _concat_same_type
return super()._concat_same_type(to_concat, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "arrays.pyx", line 190, in pandas._libs.arrays.NDArrayBacked._concat_same_type
ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 4 and the array at index 1 has size 3
I ran into a similar issue today when concatenating two dataframes that had a mix of datetime[ns]
and datetime[us]
types in a datetime column.
That appears to be the issue here @cameronbronstein - while concat on 2.2.x and main now succeed, the DataFrame originally has Colum2
as datetime64[us]
and upon resample becomes datetime64[ns]
. I think this is unexpected. Further investigations and PRs to fix are welcome!
take
Looks like the unit of the datetime type gets set to a default ns
during the construction of the DataFrame from the sampled rows. During construction, columns containing datetime type objects go through core/dtypes/cast.py/maybe_infer_datetimelike
which calls lib.maybe_convert_objects
that constructs a DatetimeIndex from the ndarray of Column2
variables. When the array goes to core/arrays/datetimes.py/_sequence_to_dt64
in the DatetimeIndex construction, data_dtype = np.dtype('O')
and out_unit = None
and out_unit
gets set to ns
.
I've played around with it a bit, when every column is a datetime type, there is no issue, but when only one or a few are datetime types, it converts them to datetime[ns]
. It also occurs with use of both iterrows
and itertuples
. I was able to fix the conversion in the issue example by fetching unit
from the datetime type object in _sequence_to_dt64
, but this runs into issues in instances when the conversion to datetime[ns]
is the expected behavior. I'm not sure what the right fix here would be.
Here is the check I implemented:
if (data_dtype is np.dtype('O') and isinstance(data, np.ndarray)):
if (len(data) > 0 and isinstance(data[0], Timestamp) and data[0].tz is None):
# catch ndarrays that should have a datetime dtype but don't
out_unit = data[0].unit