pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: .max() raises exception on Series with object dtype and mixture of Timestamp and NaT: TypeError: '>=' not supported between instances of 'Timestamp' and 'float'

Open kerrickstaley opened this issue 1 year ago • 1 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df1 = pd.DataFrame({
    'a': [pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York')],
})
df2 = pd.DataFrame({
    'a': [pd.NaT],
})
df_concat = pd.concat([df1, df2])
df_concat['a'].max()

Issue Description

The above code raises the exception:

TypeError: '>=' not supported between instances of 'Timestamp' and 'float'

This code is a simplified version of some prod code that caused issues at my work.

I believe what's happening is df1 has dtype datetime[ns, America/New_York] for the column and df2 has dtype datetime[ns], and when you concat them, the resulting dtype is object. Then, .max() coerces pd.NaT to NaN, and you get a comparison between a Timestamp and a float.

Expected Behavior

I expect the code to return pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York').

I think Pandas should be robust and handle this case, even though the dtypes aren't perfectly "correct". I think the right place to fix this is in the .max() function: the code

max([pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York'), pd.NaT])

(where max is builtins.max) works fine, so you would also expect the Pandas equivalent to work.

You could maybe also make an argument that pd.concat should special-case this and return a column with dtype datetime64[ns, America/New_York], but I'm less sure about that.

Longer-term, I feel like Pandas should support datetime columns with heterogenous timezones; the requirement that timezones be the same for a whole column feels like an artificial constraint and many real-world datasets will naturally have heterogenous timezones.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.3.final.0 python-bits : 64 OS : Darwin OS-release : 23.4.0 Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:19:22 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8112 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.1.1 pip : 24.0 Cython : None pytest : 8.1.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.1.0 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.22.2 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

kerrickstaley avatar May 13 '24 16:05 kerrickstaley

Thanks for the report. As a workaround, df_concat['a'].convert_dtypes().max() works

asishm avatar May 13 '24 21:05 asishm

take

li00-lab avatar Jul 24 '24 03:07 li00-lab

It seems like the original code, will change the data type to object after concatenating both dataframes, therefore the comparison df_concat['a'].max() failed.

import pandas as pd
df1 = pd.DataFrame({
    'a': [pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York')],
})
df2 = pd.DataFrame({
    'a': [pd.NaT],
})
df_concat = pd.concat([df1, df2])

print(df_concat.iloc[0])
print(df_concat.iloc[1])

#output
a    2024-05-13 12:00:00-04:00
Name: 0, dtype: object
a    NaT
Name: 0, dtype: object

However, if we have initialize the dataframe as, then the comparison works as the data type is datetime

df_concat = pd.DataFrame({
    "a": [pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York'), pd.NaT]
})
df_concat['a'].max()  # Timestamp('2024-05-13 12:00:00-0400', tz='America/New_York')

print(df_concat.iloc[0])
print(df_concat.iloc[1])

# Output
a   2024-05-13 12:00:00-04:00
Name: 0, dtype: datetime64[ns, America/New_York]
a   NaT
Name: 1, dtype: datetime64[ns, America/New_York]

For the workaround you provided, it will convert the datatype into datetime and the .max() works as well.

df_concat['a'].convert_dtypes().dtypes # datetime64[ns, America/New_York]
df_concat['a'].convert_dtypes().max() # Timestamp('2024-05-13 12:00:00-0400', tz='America/New_York')

Therefore, it seems like the issue here is due to concat() rather than max(). I hope this make sense and let me research further into this.

li00-lab avatar Jul 26 '24 06:07 li00-lab

This bug also arises when selecting a row and a subset of datetime64[ns] columns with .loc. Each of the columns is datetime64[ns] but .loc somehow returns a series of dtype object.

The proposed workaround of convert_dtypes also works here.

ischmidt20 avatar Oct 30 '24 16:10 ischmidt20