pandas
pandas copied to clipboard
BUG: Future limitation to prevent automatic type cast makes instantiation of empty series impossible
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
my_index = pd.Index([1, 2, 3])
ds = pd.Series(None, index=my_index)
ds.iloc[0] = "a"
Issue Description
This raises the following error: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'a' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
However, the error is due to the fact that pandas itself inferred a format for None
. So the library produces an error of its own making.
Expected Behavior
Continue to allow assinging records of any type to empty Series. In particular: If None
was instantiated and user assigns anything of type str
or int
, this should be fine
Installed Versions
INSTALLED VERSIONS
commit : f538741432edf55c6b9fb5d0d496d2dd1d7c2457 python : 3.11.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Switzerland.1252 pandas : 2.2.0 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 68.2.0 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.2 numba : None numexpr : 2.9.0 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : None tables : 3.9.2 tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
cc @MarcoGorelli
Thanks for the report!
Continue to allow assinging records of any type to empty Series.
The Series above has length 3 and so is not empty - unless maybe you're thinking of "a Series with all NA values" as being empty? I would advise not calling this an empty Series - in particular:
print(ds.empty)
# False
But even empty Series have dtypes, and assigning a value into an empty Series resulting in an upcast has all the same problems with assigning a value into a non-empty Series that lead to PDEP6.
In addition, I'd recommend controlling your dtypes when possible: relying on pandas to do inference and guess at what dtype you want can lead to unexpected errors down the road.
Thanks, rhshadrach! As this was a "discussion expected", let me add a few points where I think I disagree with you, and why I would still calls this a highly undesirable (i.e., buggy) behavior:
- The command does NOT assign any data to the Series. While I agree with your observation that
ds
is not empty, it is only not empty after initialization because pandas has already filled it withNaN
during initialization. Thereby, pandas has assigned thedtype=float64
on the fly. However, this is absolutely not intended when the user typesNone
(it's a pretty long shot to argue thatNone
is always of typefloat
). It is a clearly a side effect of intialization. Agree? - So as a consequence of this initialization, a very simple assigment such as
=2
or="a"
becomes prohibited.
Where is the problem with this, in my opinion? -> It raises the more fundamental question whether it should - fundamentally - be allowed to initialize a pandas Series without specifying the type. I think it should, but version 2.2.0
effectively prohibits that:
- A type is automatically assigned (=
float
), even for an emtpy set (I would still hold up thatNone
is empty, at least at the time of input, even if not after initialization). That has been the case for long - no issue here. BUT: - Pandas (now!) forces the user to re-assign a different type before doing anything other than
float
with this I think this fundamentally violates the idea of python (types do not need to be declared for variables to work) and even panda (assingment withoutdtype
is in fact possible).
So the current way of handling this is just inconsistent. On the one hand, you say "Ah, pandas is flexible, so you don't need to define the type. Doing None
is just fine, pandas will handle it for you and set it to float
." But then later you come back and say "wait, type must be defined, automatic cast is prohibited. you wanted something empty? too bad, now you're stuck with float
, no flexibility in assinging anything different, now".
The proper way to handle this would be - in my opinion
- either you hold up the argument that type must be defined, strictly, at time of initialization. Then let's "enforce" that the user declares the type at definition (consequence: then
pd.Series(None)
should raise, always, as type is not defined) - or, "allow" the user to declare something empty (then
pd.Series(None)
should work) but then also allow to fill empty with "something" (so later=2
or="a"
should work)
Here is where version 2.2.0
is breaking the logic. It maintains the idea of having a "flexible" initialization (i.e., does not enforce the former), but then prohibits the (formerly existing) flexibility by raising the latter.
Now, I'm aware that some might say "I want both" (i.e., both be able to declare without type, but maintain hygiene in terms of type casting) - that's what leads to the type cast restriction. What to do? My proposal would be to in fact do both, but manage the edge case: Allow "automatic type casting" only for series where all records are Nan
or None
respectively. That covers the edge case. It would allow pands to automatically initialize a dtype during construction (which is the status quo, and you argued for), but it also allows to maintain the idea of an "empty dataframe/series" that can later be filled with "whatever" (which is currently possible, too, and that I argue for should be maintained in the future). So if there is data already, you go ahead and raise if a user wants to assign something different. But in the edge case of having absolutely "nothing" in the series (and again, assigning None
leads to NaN
, so that might represent "nothing"), you continue to allow automatic casting.
A final comment: I don't really agree with your "recommendation". It's not so much about relying on pandas to cast properly, but sometimes you may want some flexibility. For example, you might want to incrementally fill a series of int
, while you still have gaps in there. So in the end, you may or may not end up with a series that can have dtype int
, or maybe must have type Int32
or so.
Of course, you could now say: "Ok, then define it as Int
from the start", but that again is highly non-pythonic. It brings us back to the fundamental thing: "does pandas require a type definition even at initialization"? I think just getting started with "something" (as python wonderfully allows us to do without defining types) and only in the end running a "postprocessing" (-> convert to Int32
if NaN
is present, else convert to int
, some point down the road) is neither particularly risky nor the craziest thing in the world either, IMO.
The problem really is that pandas does not have a symbol for "Nothing" (unlike python's None
), but instead sets "nothing" to float
- and then any argument about "type casting" becomes quite philosophical.
cc @MarcoGorelli
thanks for the ping. This looks expected to be honest - a Series has to have a dtype, and here the user's setting an incompatible element after initialisation
I'd suggest just setting the dtype upon initialisation
This will lead to people setting ds = pd.Series(None, index=my_index, dtype=object)
- which is far worse than allowing anything to overwrite NaN
only, no? (by the way not my idea, but the first solution suggested by someone else on stack overflow, even though I agree that object
would be the most adequate type for "Nothing" ;) )
that's fine, I'd rather have users intentionally shoot themselves in the foot than have pandas do the shooting for them
that's fine, I'd rather have users intentionally shoot themselves in the foot than have pandas do the shooting for them
+1
Got you. .. However, agree to disagree. If we were talking C++, Fortran (where types MUST be declared), I would agree with the logic "users have to shoot themselves". I would say in Python that should not be the case (if user does not declare and sees unexpected behavior, so bet it. types are optional) - especially since that is also how it has been with Pandas until now. And especially since None
does not even have a type in Pandas. So it feels like you are really introducing a paradigm shift, here. But I see your reasoning, too. Thanks for the discussion.
Thanks for the discussion.
thank you! closing then