pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Future limitation to prevent automatic type cast makes instantiation of empty series impossible

Open KingOtto123 opened this issue 1 year ago • 8 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

my_index = pd.Index([1, 2, 3])
ds = pd.Series(None, index=my_index)
ds.iloc[0] = "a"

Issue Description

This raises the following error: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'a' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.

However, the error is due to the fact that pandas itself inferred a format for None. So the library produces an error of its own making.

Expected Behavior

Continue to allow assinging records of any type to empty Series. In particular: If None was instantiated and user assigns anything of type str or int, this should be fine

Installed Versions

INSTALLED VERSIONS

commit : f538741432edf55c6b9fb5d0d496d2dd1d7c2457 python : 3.11.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Switzerland.1252 pandas : 2.2.0 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 68.2.0 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.2 numba : None numexpr : 2.9.0 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : None tables : 3.9.2 tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

KingOtto123 avatar Feb 15 '24 18:02 KingOtto123

cc @MarcoGorelli

mroeschke avatar Feb 15 '24 19:02 mroeschke

Thanks for the report!

Continue to allow assinging records of any type to empty Series.

The Series above has length 3 and so is not empty - unless maybe you're thinking of "a Series with all NA values" as being empty? I would advise not calling this an empty Series - in particular:

print(ds.empty)
# False

But even empty Series have dtypes, and assigning a value into an empty Series resulting in an upcast has all the same problems with assigning a value into a non-empty Series that lead to PDEP6.

In addition, I'd recommend controlling your dtypes when possible: relying on pandas to do inference and guess at what dtype you want can lead to unexpected errors down the road.

rhshadrach avatar Feb 15 '24 22:02 rhshadrach

Thanks, rhshadrach! As this was a "discussion expected", let me add a few points where I think I disagree with you, and why I would still calls this a highly undesirable (i.e., buggy) behavior:

  • The command does NOT assign any data to the Series. While I agree with your observation that ds is not empty, it is only not empty after initialization because pandas has already filled it with NaN during initialization. Thereby, pandas has assigned the dtype=float64 on the fly. However, this is absolutely not intended when the user types None (it's a pretty long shot to argue that None is always of type float). It is a clearly a side effect of intialization. Agree?
  • So as a consequence of this initialization, a very simple assigment such as =2 or ="a" becomes prohibited.

Where is the problem with this, in my opinion? -> It raises the more fundamental question whether it should - fundamentally - be allowed to initialize a pandas Series without specifying the type. I think it should, but version 2.2.0 effectively prohibits that:

  • A type is automatically assigned (=float), even for an emtpy set (I would still hold up that None is empty, at least at the time of input, even if not after initialization). That has been the case for long - no issue here. BUT:
  • Pandas (now!) forces the user to re-assign a different type before doing anything other than float with this I think this fundamentally violates the idea of python (types do not need to be declared for variables to work) and even panda (assingment without dtype is in fact possible).

So the current way of handling this is just inconsistent. On the one hand, you say "Ah, pandas is flexible, so you don't need to define the type. Doing None is just fine, pandas will handle it for you and set it to float." But then later you come back and say "wait, type must be defined, automatic cast is prohibited. you wanted something empty? too bad, now you're stuck with float, no flexibility in assinging anything different, now".

The proper way to handle this would be - in my opinion

  • either you hold up the argument that type must be defined, strictly, at time of initialization. Then let's "enforce" that the user declares the type at definition (consequence: then pd.Series(None) should raise, always, as type is not defined)
  • or, "allow" the user to declare something empty (then pd.Series(None) should work) but then also allow to fill empty with "something" (so later =2 or ="a" should work)

Here is where version 2.2.0 is breaking the logic. It maintains the idea of having a "flexible" initialization (i.e., does not enforce the former), but then prohibits the (formerly existing) flexibility by raising the latter.

Now, I'm aware that some might say "I want both" (i.e., both be able to declare without type, but maintain hygiene in terms of type casting) - that's what leads to the type cast restriction. What to do? My proposal would be to in fact do both, but manage the edge case: Allow "automatic type casting" only for series where all records are Nan or None respectively. That covers the edge case. It would allow pands to automatically initialize a dtype during construction (which is the status quo, and you argued for), but it also allows to maintain the idea of an "empty dataframe/series" that can later be filled with "whatever" (which is currently possible, too, and that I argue for should be maintained in the future). So if there is data already, you go ahead and raise if a user wants to assign something different. But in the edge case of having absolutely "nothing" in the series (and again, assigning None leads to NaN, so that might represent "nothing"), you continue to allow automatic casting.

KingOtto123 avatar Feb 16 '24 07:02 KingOtto123

A final comment: I don't really agree with your "recommendation". It's not so much about relying on pandas to cast properly, but sometimes you may want some flexibility. For example, you might want to incrementally fill a series of int, while you still have gaps in there. So in the end, you may or may not end up with a series that can have dtype int, or maybe must have type Int32 or so.

Of course, you could now say: "Ok, then define it as Int from the start", but that again is highly non-pythonic. It brings us back to the fundamental thing: "does pandas require a type definition even at initialization"? I think just getting started with "something" (as python wonderfully allows us to do without defining types) and only in the end running a "postprocessing" (-> convert to Int32 if NaN is present, else convert to int, some point down the road) is neither particularly risky nor the craziest thing in the world either, IMO.

The problem really is that pandas does not have a symbol for "Nothing" (unlike python's None), but instead sets "nothing" to float - and then any argument about "type casting" becomes quite philosophical.

KingOtto123 avatar Feb 16 '24 07:02 KingOtto123

cc @MarcoGorelli

thanks for the ping. This looks expected to be honest - a Series has to have a dtype, and here the user's setting an incompatible element after initialisation

I'd suggest just setting the dtype upon initialisation

MarcoGorelli avatar Feb 16 '24 16:02 MarcoGorelli

This will lead to people setting ds = pd.Series(None, index=my_index, dtype=object) - which is far worse than allowing anything to overwrite NaN only, no? (by the way not my idea, but the first solution suggested by someone else on stack overflow, even though I agree that object would be the most adequate type for "Nothing" ;) )

KingOtto123 avatar Feb 16 '24 17:02 KingOtto123

that's fine, I'd rather have users intentionally shoot themselves in the foot than have pandas do the shooting for them

MarcoGorelli avatar Feb 16 '24 17:02 MarcoGorelli

that's fine, I'd rather have users intentionally shoot themselves in the foot than have pandas do the shooting for them

+1

rhshadrach avatar Feb 16 '24 21:02 rhshadrach

Got you. .. However, agree to disagree. If we were talking C++, Fortran (where types MUST be declared), I would agree with the logic "users have to shoot themselves". I would say in Python that should not be the case (if user does not declare and sees unexpected behavior, so bet it. types are optional) - especially since that is also how it has been with Pandas until now. And especially since None does not even have a type in Pandas. So it feels like you are really introducing a paradigm shift, here. But I see your reasoning, too. Thanks for the discussion.

KingOtto123 avatar Feb 18 '24 18:02 KingOtto123

Thanks for the discussion.

thank you! closing then

MarcoGorelli avatar Feb 18 '24 20:02 MarcoGorelli