pandas
pandas copied to clipboard
BUG: Series.astype is unable to handle NaN
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
if a:
return b
else:
return c
df["a"].astype(object).apply(lambda r: iif(np.isnan(r),None,r)).astype(object).apply(lambda r: iif(np.isnan(r),None,r))
df
Issue Description
The result is:
0 2.0
1 NaN
2 1.0
Expected Behavior
The result should be:
0 2.0
1 None
2 1.0
I checked BUG: Replacing NaN with None in Pandas 1.3 does not work But it seems astype doesn't work here.
Installed Versions
INSTALLED VERSIONS
commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.8.10.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.950
pandas : 1.4.1 numpy : 1.22.3 pytz : 2021.3 dateutil : 2.8.2 pip : 21.1.1 setuptools : 56.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.1.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 sqlalchemy : 1.4.32 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None
However,
df.where(pd.notnull(df), None)
could lead to
0 2.0
1 None
2 1.0
Thanks for the report! I wonder if apply is coercing the result into a numeric dtype. I would agree this is not expected, further investigations and PRs to fix are welcome!
take
Hi, @ingted . I have been looking over your example, and I think that maybe the issue is not with .astype()
. The reason why this example produces the provided Series with the second value being NaN
rather than None
is because, by default, as @rhshadrach guessed, .apply()
has convert_dtype=True
. This coerces the dtype of each element the provided function is applied to to be whatever it deems the best possible dtype. If you instead change your example to the following, with convert_dtype=False
, your desired output is achieved:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
if a:
return b
else:
return c
df["a"].apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=False)
df
Including the then redundant occurrences of .astype()
also works:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
if a:
return b
else:
return c
df["a"].astype(object).apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=False).astype(object)
df
Let me know if I am misunderstanding the issue, and that there is a problem with .astype()
.
That being said, it is interesting that the conversion chosen for .apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=True)
is converting the Series to float64 instead of keeping it as object. Will look into this more.
@rhshadrach provided this information, would you agree that this behavior of .apply()
with convert_dtype=True
is unexpected?
.apply()
in this case calls map_infer()
which, after applying the function to all elements, if convert_dtype=True
, calls maybe_convert_objects()
. This decides to convert the Series to float64 because it saw that the Series consisted of floats and a None
. From the code, it actually does look intentional. The specific control flow I am talking about is (in pandas/_libs/lib.pyx
):
floats = cnp.PyArray_EMPTY(1, objects.shape, cnp.NPY_FLOAT64, 0)
...
for i in range(n):
val = objects[i]
if itemsize_max != -1:
itemsize = get_itemsize(val)
if itemsize > itemsize_max or itemsize == -1:
itemsize_max = itemsize
if val is None:
seen.null_ = True
floats[i] = complexes[i] = fnan
mask[i] = True
elif val is NaT:
seen.nat_ = True
if convert_datetime:
idatetimes[i] = NPY_NAT
if convert_timedelta:
itimedeltas[i] = NPY_NAT
if not (convert_datetime or convert_timedelta or convert_period):
seen.object_ = True
break
elif val is np.nan:
seen.nan_ = True
mask[i] = True
floats[i] = complexes[i] = val
elif util.is_bool_object(val):
seen.bool_ = True
bools[i] = val
elif util.is_float_object(val):
floats[i] = complexes[i] = val
seen.float_ = True
...
if not seen.object_:
result = None
if not safe:
if seen.null_ or seen.nan_:
if seen.is_float_or_complex:
if seen.complex_:
result = complexes
elif seen.float_:
result = floats
...
if result is uints or result is ints or result is floats or result is complexes:
# cast to the largest itemsize when all values are NumPy scalars
if itemsize_max > 0 and itemsize_max != result.dtype.itemsize:
result = result.astype(result.dtype.kind + str(itemsize_max))
return result
elif result is not None:
return result
Not sure why convert_dtype=False should be required if the user doesn't know it...
At least similar operation on datafrime doesn't need to specify convert_dtype=false...
@rhshadrach provided this information, would you agree that this behavior of
.apply()
withconvert_dtype=True
is unexpected?
At current state - there are various places where currently we treat None
as np.nan
and this is consistent with that.
df = pd.DataFrame({'a': [1, 2, None]})
print(df)
# a
# 0 1.0
# 1 2.0
# 2 NaN
df = pd.DataFrame({'a': [1, np.nan, None], 'b': [2, 3, 4]})
gb = df.groupby('a', dropna=False)
print(gb.sum())
# b
# a
# 1.0 2
# NaN 7
That said, I have a hunch that it would be better if pandas were to treat None
as a Python object instead of np.nan
. However, this issue needs more study for me to feel more certain of this.