BUG: .max() raises exception on Series with object dtype and mixture of Timestamp and NaT: TypeError: '>=' not supported between instances of 'Timestamp' and 'float'
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df1 = pd.DataFrame({
'a': [pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York')],
})
df2 = pd.DataFrame({
'a': [pd.NaT],
})
df_concat = pd.concat([df1, df2])
df_concat['a'].max()
Issue Description
The above code raises the exception:
TypeError: '>=' not supported between instances of 'Timestamp' and 'float'
This code is a simplified version of some prod code that caused issues at my work.
I believe what's happening is df1 has dtype datetime[ns, America/New_York] for the column and df2 has dtype datetime[ns], and when you concat them, the resulting dtype is object. Then, .max() coerces pd.NaT to NaN, and you get a comparison between a Timestamp and a float.
Expected Behavior
I expect the code to return pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York').
I think Pandas should be robust and handle this case, even though the dtypes aren't perfectly "correct". I think the right place to fix this is in the .max() function: the code
max([pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York'), pd.NaT])
(where max is builtins.max) works fine, so you would also expect the Pandas equivalent to work.
You could maybe also make an argument that pd.concat should special-case this and return a column with dtype datetime64[ns, America/New_York], but I'm less sure about that.
Longer-term, I feel like Pandas should support datetime columns with heterogenous timezones; the requirement that timezones be the same for a whole column feels like an artificial constraint and many real-world datasets will naturally have heterogenous timezones.
Installed Versions
pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.1.1 pip : 24.0 Cython : None pytest : 8.1.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.1.0 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.22.2 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
Thanks for the report. As a workaround, df_concat['a'].convert_dtypes().max() works
take
It seems like the original code, will change the data type to object after concatenating both dataframes, therefore the comparison df_concat['a'].max() failed.
import pandas as pd
df1 = pd.DataFrame({
'a': [pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York')],
})
df2 = pd.DataFrame({
'a': [pd.NaT],
})
df_concat = pd.concat([df1, df2])
print(df_concat.iloc[0])
print(df_concat.iloc[1])
#output
a 2024-05-13 12:00:00-04:00
Name: 0, dtype: object
a NaT
Name: 0, dtype: object
However, if we have initialize the dataframe as, then the comparison works as the data type is datetime
df_concat = pd.DataFrame({
"a": [pd.Timestamp('2024-05-13 12:00:00', tz='America/New_York'), pd.NaT]
})
df_concat['a'].max() # Timestamp('2024-05-13 12:00:00-0400', tz='America/New_York')
print(df_concat.iloc[0])
print(df_concat.iloc[1])
# Output
a 2024-05-13 12:00:00-04:00
Name: 0, dtype: datetime64[ns, America/New_York]
a NaT
Name: 1, dtype: datetime64[ns, America/New_York]
For the workaround you provided, it will convert the datatype into datetime and the .max() works as well.
df_concat['a'].convert_dtypes().dtypes # datetime64[ns, America/New_York]
df_concat['a'].convert_dtypes().max() # Timestamp('2024-05-13 12:00:00-0400', tz='America/New_York')
Therefore, it seems like the issue here is due to concat() rather than max(). I hope this make sense and let me research further into this.
This bug also arises when selecting a row and a subset of datetime64[ns] columns with .loc. Each of the columns is datetime64[ns] but .loc somehow returns a series of dtype object.
The proposed workaround of convert_dtypes also works here.