pandas
pandas copied to clipboard
BUG: fillna('') does not replace NaT
pandas generally tries to coerce values to fit the column dtype, or upcasts the dtype to fit.
For a setting operation this is convenient & I think expected as a user
In [35]: df = DataFrame({'A' : Series(dtype='M8[ns]'), 'B' : Series([np.nan],dtype='object'), 'C' : ['foo'], 'D' : [1]})
In [36]: df
Out[36]:
A B C D
0 NaT NaN foo 1
In [37]: df.dtypes
Out[37]:
A datetime64[ns]
B object
C object
D int64
dtype: object
In [38]: df.loc[0,'D'] = 1.0
In [39]: df.dtypes
Out[39]:
A datetime64[ns]
B object
C object
D float64
dtype: object
However for a .fillna (or .replace) operation this might be a bit unexpected. So A was coerced to object dtype, even though it was datetime64[ns].
In [40]: df.fillna('')
Out[40]:
A B C D
0 foo 1
In [41]: df.fillna('').dtypes
Out[41]:
A object
B object
C object
D float64
dtype: object
So a possibility is to add a keyword errors='raise'|'coerce'|'ignore'. This last behavior would be equiv of errors='coerce'. While skipping this column would be done with errors='coerce'. (and of course raise would raise.
Ideally would have a default of coerce I think (to skip for non-compat values). Any thoughts on this?
cc @ywang007
xref. #15533
@jreback I think this keyword would be a :+1:. This would be a way of harmonizing the for/against validating forcefully/weakly that are under discussion at PR#15587. Once that PR is added, this behavior could presumably be added as a single if errors == 'raise': validate_fill_value(obj, value) call.
I think it's worth considering adding similar behavior to methods implementing fill_value. I'm not sure I like that idea, it feels like a lot of API overhead, but, worth considering.
This behavior no longer coerces to object. I supposed it could use a test orthoganal to the enhancement request
In [34]: In [35]: df = DataFrame({'A' : Series(dtype='M8[ns]'), 'B' : Series([np.nan],dtype='object'), 'C' : [
...: 'foo'], 'D' : [1]})
In [35]: In [38]: df.loc[0,'D'] = 1.0
In [36]: df.dtypes
Out[36]:
A datetime64[ns]
B object
C object
D int64
dtype: object
In [37]: In [40]: df.fillna('')
Out[37]:
A B C D
0 NaT foo 1
In [38]: In [41]: df.fillna('').dtypes
Out[38]:
A datetime64[ns]
B object
C object
D int64
dtype: object
In [39]: pd.__version__
Out[39]: '1.3.0.dev0+1383.g855696cde0'
Actually I think this is a bug and the original behavior was correct. NaT is a "na value" that wasn't replaced by empty string
In [1]: df = DataFrame({'A': Series(dtype='M8[ns]'), 'B': Series([np.nan], dtype='object'), 'C': ['foo'], 'D': [1]})
In [2]: df.fillna('')
Out[2]:
A B C D
0 NaT foo 1
In [3]: df.fillna('').dtypes
Out[3]:
A datetime64[ns]
B object
C object
D int64
dtype: object
In [4]: df.fillna(2).dtypes
Out[4]:
A int64
B int64
C object
D int64
dtype: object
In [5]: df.fillna(2)
Out[5]:
A B C D
0 2 2 foo 1
Hello, just to add to this thread. I have encountered this bug when upgrading pandas from 1.2.5 to 1.3.3 (it looks like this bug was introduced in version 1.3.0).
When using fillna or replace on a datetime series, converting to empty string "" will not work. However, when using another string e.g. "hello" it will work, and coerce the series to object type. Also interestingly, df.replace({pd.NaT: ""}) has different behaviour to df.replace(pd.NaT, "")
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"A": [pd.NaT]})
In [3]: df.fillna("")
Out[3]:
A
0 NaT
In [4]: df.fillna("hello")
Out[4]:
A
0 hello
In [5]: df.replace(pd.NaT, "")
Out[5]:
A
0 NaT
In [6]: df.replace(pd.NaT, "hello")
Out[6]:
A
0 hello
In [7]: df.replace({pd.NaT, ""})
Out[7]:
A
0 NaT
In [8]: df.replace({pd.NaT, "hello"})
Out[8]:
A
0 NaT
Also reproduced on 1.3.4
same here on latest 1.4.2, pd.fillna('') doesn't work with NaT (pd.isnull() gives True though)
pd.fillna('something') works...
Very surpising it has been here since 2016 ?
same on version 1.4.3, df = pd.DataFrame({"A": [pd.NaT]}), df.fillna("") will do nothing, df.fillna(" ") will replace NaT with a blank space.
same here, NaT still shows if fill na with empty string df.fillna('')
The core issue here appears to be specifically because the Timestamp constructor interprets empty string as pd.NaT and therefore the datetime64 type is not upcast to object
In [8]: pd.Timestamp("")
Out[8]: NaT
In [9]: pd.Timestamp(" ")
ValueError: could not convert string to Timestamp
If the behavior of Out[8] was deprecated to not return NaT then this behavior would probably be fixed
This might be the temporary measure 👍
# 1. convert datetime to string
df["target"] = df["target"].dt.strftime('%Y-%m-%d %H:%M:%S')
# 2. fillna
replace_datetime_in_str = "2023-01-01 00:00:00"
df["target"] = df["target"].fillna(replace_dt)
# 3. convert string to datetime
df["target"] = pd.to_datetime(df["target"])
I'm a novice, but it seems to still be present in 2.0.1
I'm a novice, but it seems to still be present in 2.0.1
still present
There is also a bug when replacing with the string "NAN" :
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"A": [pd.NaT]})
In [3]: df.fillna("")
Out[3]:
A
0 NaT
In [4]: df.fillna("hello")
Out[4]:
A
0 hello
In [5]: df.fillna("NAN")
Out[5]:
A
0 NaT
In [6]: df.fillna("NAN_")
Out[6]:
A
0 NAN_