pandas icon indicating copy to clipboard operation
pandas copied to clipboard

API: distinguish NA vs NaN in floating dtypes

Open jorisvandenbossche opened this issue 4 years ago • 117 comments

Context: in the original pd.NA proposal (https://github.com/pandas-dev/pandas/issues/28095) the topic about pd.NA vs np.nan was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context of np.nan for float as pd.NaT for datetime-like).

With the introduction of pd.NA, and if we want consistent "NA behaviour" across dtypes at some point in the future, I think there are two options for float dtypes:

  • Keep using np.nan as we do now, but change its behaviour (e.g. in comparison ops) to match pd.NA
  • Start using pd.NA in float dtypes

Personally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of np.nan.
For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as "missing" or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)


Actual discussion items: assume we are going to add floating dtypes that use pd.NA as missing value indicator. Then the following question comes up:

If I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?

So yes, it is technically possible to have both np.nan and pd.NA with different behaviour (np.nan as "normal", unmasked value in the actual data, pd.NA tracked in the mask). But we also need to decide if we want this.

This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in https://github.com/pandas-dev/pandas/issues/28095:

[@Dr-Irv in comment] I think it is important to distinguish between NA meaning "data missing" versus NaN meaning "not a number" / "bad computational result".

vs

[@datapythonista in comment] I think NaN and NaT, when present, should be copied to the mask, and then we can forget about them (for what I understand values with True in the NA mask won't be ever used).

So I think those two describe nicely the two options we have on the question do we want both pd.NA and np.nan in a float dtype and have them signify different things? -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and "disallow" NaN (or interpret / convert any NaN on input to NA).

A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post). That reasoning was given by @Dr-Irv in https://github.com/pandas-dev/pandas/issues/28095#issuecomment-538786581: there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning "missing data". So should there be separate markers - one to mean "missing value" and the other to mean "bad computational result" (typically 0/0) ?

A dummy example showing how both can occur:

>>>  pd.Series([0, 1, 2]) / pd.Series([0, 1, pd.NA])
0    NaN
1    1.0
2   <NA>
dtype: float64

The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).

So, yes, it is possible and potentially desirable to allow both pd.NA and np.nan in floating dtypes. But, it also brings up several questions / complexities. Foremost, should NaN still be considered as missing? Meaning, should it be seen as missing in functions like isna/notna/dropna/fillna ? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have a skipna keyword, like sum, mean, etc)?

Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).


Some other various considerations:

  • Having both pd.NA and NaN (np.nan) might actually be more confusing for users.

  • If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)

  • How do we handle compatibility with numpy? The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have a to_numpy(.., na_value=np.nan) explicit conversion. But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying.

    For conversion to numpy, see also some relevant discussion in https://github.com/pandas-dev/pandas/issues/30038

  • What with conversion / inference on input? Eg creating a Series from a float numpy array with NaNs (pdSeries(np.array([0.1, np.nan]))) Do we convert NaNs to NA automatically by default?

cc @pandas-dev/pandas-core @Dr-Irv @dsaxton

jorisvandenbossche avatar Feb 26 '20 10:02 jorisvandenbossche