pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Series.astype is unable to handle NaN

Open ingted opened this issue 2 years ago • 7 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
    if a:
        return b
    else:
        return c

df["a"].astype(object).apply(lambda r: iif(np.isnan(r),None,r)).astype(object).apply(lambda r: iif(np.isnan(r),None,r))
df

Issue Description

The result is:

0    2.0
1    NaN
2    1.0

Expected Behavior

The result should be:

0    2.0
1    None
2    1.0

I checked BUG: Replacing NaN with None in Pandas 1.3 does not work But it seems astype doesn't work here.

Installed Versions

INSTALLED VERSIONS

commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.8.10.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.950

pandas : 1.4.1 numpy : 1.22.3 pytz : 2021.3 dateutil : 2.8.2 pip : 21.1.1 setuptools : 56.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.1.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 sqlalchemy : 1.4.32 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

ingted avatar Mar 15 '22 22:03 ingted

However,

df.where(pd.notnull(df), None)

could lead to

0    2.0
1    None
2    1.0

ingted avatar Mar 15 '22 22:03 ingted

Thanks for the report! I wonder if apply is coercing the result into a numeric dtype. I would agree this is not expected, further investigations and PRs to fix are welcome!

rhshadrach avatar Jun 28 '22 20:06 rhshadrach

take

kapiliyer avatar Jul 18 '22 19:07 kapiliyer

Hi, @ingted . I have been looking over your example, and I think that maybe the issue is not with .astype(). The reason why this example produces the provided Series with the second value being NaN rather than None is because, by default, as @rhshadrach guessed, .apply() has convert_dtype=True. This coerces the dtype of each element the provided function is applied to to be whatever it deems the best possible dtype. If you instead change your example to the following, with convert_dtype=False, your desired output is achieved:

import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
    if a:
        return b
    else:
        return c

df["a"].apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=False)
df

Including the then redundant occurrences of .astype() also works:

import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[2.0, np.nan, 1.0]})
df = df.astype(object)
def iif(a, b, c):
    if a:
        return b
    else:
        return c

df["a"].astype(object).apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=False).astype(object)
df

Let me know if I am misunderstanding the issue, and that there is a problem with .astype().

That being said, it is interesting that the conversion chosen for .apply(lambda r: iif(np.isnan(r),None,r), convert_dtype=True) is converting the Series to float64 instead of keeping it as object. Will look into this more.

@rhshadrach provided this information, would you agree that this behavior of .apply() with convert_dtype=True is unexpected?

kapiliyer avatar Jul 25 '22 03:07 kapiliyer

.apply() in this case calls map_infer() which, after applying the function to all elements, if convert_dtype=True, calls maybe_convert_objects(). This decides to convert the Series to float64 because it saw that the Series consisted of floats and a None. From the code, it actually does look intentional. The specific control flow I am talking about is (in pandas/_libs/lib.pyx):

    floats = cnp.PyArray_EMPTY(1, objects.shape, cnp.NPY_FLOAT64, 0)

...

    for i in range(n):
        val = objects[i]
        if itemsize_max != -1:
            itemsize = get_itemsize(val)
            if itemsize > itemsize_max or itemsize == -1:
                itemsize_max = itemsize

        if val is None:
            seen.null_ = True
            floats[i] = complexes[i] = fnan
            mask[i] = True
        elif val is NaT:
            seen.nat_ = True
            if convert_datetime:
                idatetimes[i] = NPY_NAT
            if convert_timedelta:
                itimedeltas[i] = NPY_NAT
            if not (convert_datetime or convert_timedelta or convert_period):
                seen.object_ = True
                break
        elif val is np.nan:
            seen.nan_ = True
            mask[i] = True
            floats[i] = complexes[i] = val
        elif util.is_bool_object(val):
            seen.bool_ = True
            bools[i] = val
        elif util.is_float_object(val):
            floats[i] = complexes[i] = val
            seen.float_ = True

...

    if not seen.object_:
        result = None
        if not safe:
            if seen.null_ or seen.nan_:
                if seen.is_float_or_complex:
                    if seen.complex_:
                        result = complexes
                    elif seen.float_:
                        result = floats

...

        if result is uints or result is ints or result is floats or result is complexes:
            # cast to the largest itemsize when all values are NumPy scalars
            if itemsize_max > 0 and itemsize_max != result.dtype.itemsize:
                result = result.astype(result.dtype.kind + str(itemsize_max))
            return result
        elif result is not None:
            return result

kapiliyer avatar Jul 25 '22 03:07 kapiliyer

Not sure why convert_dtype=False should be required if the user doesn't know it...

At least similar operation on datafrime doesn't need to specify convert_dtype=false...

ingted avatar Jul 25 '22 04:07 ingted

@rhshadrach provided this information, would you agree that this behavior of .apply() with convert_dtype=True is unexpected?

At current state - there are various places where currently we treat None as np.nan and this is consistent with that.

df = pd.DataFrame({'a': [1, 2, None]})
print(df)

#      a
# 0  1.0
# 1  2.0
# 2  NaN

df = pd.DataFrame({'a': [1, np.nan, None], 'b': [2, 3, 4]})
gb = df.groupby('a', dropna=False)
print(gb.sum())

#      b
# a     
# 1.0  2
# NaN  7

That said, I have a hunch that it would be better if pandas were to treat None as a Python object instead of np.nan. However, this issue needs more study for me to feel more certain of this.

rhshadrach avatar Aug 07 '22 12:08 rhshadrach