pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Concat for two dataframes row-wise fails if one of columns is datetime and iterrows was used

Open ilyakochik opened this issue 1 year ago • 7 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import datetime

data = {'Column1': [1, 2, 3, 33],
        'Column2': [4, 5, 6, 66],
        'Column3': [7, 8, 9, 99]}
df = pd.DataFrame(data)


# once the below is uncommented, the pd.concat would fail...
# df["Column2"] = datetime.datetime(2018, 1, 1)

# ... if this is uncommented, it would work though
# df["Column2"] = datetime.date(2018, 1, 1)

random_rows = pd.DataFrame([row for _, row in df.sample(n=3).iterrows()])
df = pd.concat([df, random_rows])
print(df)

Issue Description

The pd.concat of two dataframes fails when two conditions are met:

  • one of the columns is datetime (if it is date everything works though)
  • the second dataframe in concat was constructed using .iterrows()

Changing iterrows() to itertuples() (in case something broke down in types) doesn't solve the problem

Expected Behavior

It works with the date object, so would expect it to be a bug.

Installed Versions

INSTALLED VERSIONS

commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673 python : 3.11.7.final.0 python-bits : 64 OS : Darwin OS-release : 23.1.0 Version : Darwin Kernel Version 23.1.0: Mon Oct 9 21:28:12 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T8103 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.1.4 numpy : 1.26.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None bs4 : None bottleneck : 1.3.5 dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : 2.8.7 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

ilyakochik avatar Jan 08 '24 17:01 ilyakochik

Can you post the traceback?

phofl avatar Jan 08 '24 21:01 phofl

I am not able to replicate on main.

rhshadrach avatar Jan 08 '24 22:01 rhshadrach

Can you post the traceback?

cd /Users/ilko/Code/clay/backend ; /usr/bin/env /Users/ilko/Code/clay/backend/.conda/bin/python /Users/ilko/.vscode
/extensions/ms-python.python-2023.22.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 63255 -- main.py --init 
Traceback (most recent call last):
  File "/Users/ilko/Code/clay/backend/main.py", line 17, in <module>
    df = pd.concat([df, random_rows])
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/reshape/concat.py", line 393, in concat
    return op.get_result()
           ^^^^^^^^^^^^^^^
  File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/reshape/concat.py", line 680, in get_result
    new_data = concatenate_managers(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/internals/concat.py", line 189, in concatenate_managers
    values = _concatenate_join_units(join_units, copy=copy)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
    concat_values = concat_compat(to_concat, axis=1)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/dtypes/concat.py", line 132, in concat_compat
    return cls._concat_same_type(to_concat_eas)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/arrays/datetimelike.py", line 2251, in _concat_same_type
    new_obj = super()._concat_same_type(to_concat, axis)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ilko/Code/clay/backend/.conda/lib/python3.11/site-packages/pandas/core/arrays/_mixins.py", line 230, in _concat_same_type
    return super()._concat_same_type(to_concat, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "arrays.pyx", line 190, in pandas._libs.arrays.NDArrayBacked._concat_same_type
ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 4 and the array at index 1 has size 3

ilyakochik avatar Jan 10 '24 14:01 ilyakochik

I ran into a similar issue today when concatenating two dataframes that had a mix of datetime[ns] and datetime[us] types in a datetime column.

cameronbronstein avatar Feb 16 '24 01:02 cameronbronstein

That appears to be the issue here @cameronbronstein - while concat on 2.2.x and main now succeed, the DataFrame originally has Colum2 as datetime64[us] and upon resample becomes datetime64[ns]. I think this is unexpected. Further investigations and PRs to fix are welcome!

rhshadrach avatar Mar 02 '24 19:03 rhshadrach

take

emiford avatar Apr 04 '24 01:04 emiford

Looks like the unit of the datetime type gets set to a default ns during the construction of the DataFrame from the sampled rows. During construction, columns containing datetime type objects go through core/dtypes/cast.py/maybe_infer_datetimelike which calls lib.maybe_convert_objects that constructs a DatetimeIndex from the ndarray of Column2 variables. When the array goes to core/arrays/datetimes.py/_sequence_to_dt64 in the DatetimeIndex construction, data_dtype = np.dtype('O') and out_unit = None and out_unit gets set to ns.

I've played around with it a bit, when every column is a datetime type, there is no issue, but when only one or a few are datetime types, it converts them to datetime[ns]. It also occurs with use of both iterrows and itertuples. I was able to fix the conversion in the issue example by fetching unit from the datetime type object in _sequence_to_dt64, but this runs into issues in instances when the conversion to datetime[ns] is the expected behavior. I'm not sure what the right fix here would be.

Here is the check I implemented: if (data_dtype is np.dtype('O') and isinstance(data, np.ndarray)): if (len(data) > 0 and isinstance(data[0], Timestamp) and data[0].tz is None): # catch ndarrays that should have a datetime dtype but don't out_unit = data[0].unit

emiford avatar Apr 26 '24 20:04 emiford