pandas
pandas copied to clipboard
FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error
- [x] closes #47677
- [x] Tests added and passed if fixing a bug or adding a new feature
- [x] All code checks passed.
- [x] Added type annotations to new arguments/methods/functions.
- [x] Added an entry in the latest
doc/source/whatsnew/v1.5.0.rstfile if fixing a bug or adding a new feature.
What happens when a string is used for enlarging? Like a
The series is also cast to object, which might not be the behaviour we want. But this also happens previous to the regression (tested on v1.4.3).
Yeah this is also a bug then. Could you check if the categorical dtypes match and then return the dtype accordingly?
Yeah this is also a bug then. Could you check if the categorical dtypes match and then return the dtype accordingly?
How should we tackle that problem? If the enlarging value can be cast to categorical we should end up with a categorical series (so enlarging with 0 as in the issue also yields a categorical series)? Or only if the enlarging value is already one of the categories?
How does concat behave in that case?
How does concat behave in that case?
Like this
import pandas as pd
s = pd.Series(["a", "b", "c"], dtype="category")
t = pd.concat([s, pd.Series(["a"], index=[3])]) # dtype: object
t2 = pd.concat([s, pd.Series(["a"], index=[3], dtype="category")]) # dtype: object
I would consider this fine for t but not for t2. Note that I checked this for v1.4.3 and my fix and the behaviour is the same for both.
Yeah I think t2 is off
Could you try t2 with categories specified as a,b, c?
Could you try t2 with categories specified as a,b, c?
This is interesting:
t = pd.concat([s, pd.Series(["a"])]) # dtype object
t2 = pd.concat([s, pd.Series(["a"], dtype="category")]) # dtype object
t3 = pd.concat([s, pd.Series(["a", "b"], dtype="category")]) # dtype object
t4 = pd.concat([s, pd.Series(["a", "b", "c"], dtype="category")]) # dtype category
t5 = pd.concat([s, pd.Series(["a", "b", "a", "b", "c"], dtype="category")]) #dtype category
t6 = pd.concat([s, pd.Series(["a", "b", "a", "b", "a"], dtype="category")]) #dtype object
t7 = pd.concat([s, pd.Series(["a", "b", "d"], dtype="category")]) # dtype object
Only if the categories are match exactly we preserve the category type. This looks really buggy. But this is not a regression, checked it on v1.4.3, main and my fix and the behaviour is identical on all. Should we create a seperate issue for this?
No this make sense, the dtypes are not equal with different categories. But enlargement with a scalar is a different case, we should preserve categorical dtype there. We have to create a series with the correct categories in that case
No this make sense, the dtypes are not equal with different categories. But enlargement with a scalar is a different case, we should preserve categorical dtype there. We have to create a series with the correct categories in that case
I updated the PR to preserve the categorical dtype if the enlarging element is already in the categories. I needed a special case
in _setitem_with_indexer_missing for this since we don't want to change concat_compat for this.
Please note that in
s = pd.DataFrame([["a", "b"], ["c", "k"]], dtype="category")
s.loc[3] = "a"
both axis are cast to object. This behaviour was not changed. But to be consistent I would expect, that the first column should stay categorical while the second should change.
DataFrame cases don't work yet, this is correct. But this has multiple issues still, so lets focus on Series here
Could you also add a test where we enlarge with nan? This is not part of the categories but should probably work? Not sure
Could you also add a test where we enlarge with nan? This is not part of the categories but should probably work? Not sure
Yep, it works in the sense that after enlarging with nan the series has dtype object. But I think that's fine. I added the test
I was referring to the opposite.
result = Series([1, 2, 3], dtype=pd.CategoricalDtype(categories=pd.Index([1, 2, 3])))
result.loc[1] = np.nan # np.nan and pd.NA
this keeps categorical dtype, e.g. enlargement should too. Could you add tests for enlarging and setting into it to test that they are consistent?
Could you also add tests so that the integer dtype is kept like above? e.g. Int64 stays Int64, Int32 stays Int32 and so on. (we have fixtures for that)
you also add tests so that the integer dtype is kept like above? e.g. Int64 stays Int64, Int32 stays Int32 and so on. (we have fixtures for that)
Found one fixture in tests/groupby/test_function.py called dtypes_for_minmax but that doesn't look for reusing in other files. Also didn't check for np.int32 and np.float32 since
pd.CategoricalDtype(pd.array([1, 2, 3], dtype=np.int32)).__dict__
>>> {'_categories': Int64Index([1, 2, 3], dtype='int64'), '_ordered': False}
these are cast implicitly.
Its called something like any_numeric_dtype
Its called something like any_numeric_dtype
used any_numeric_ea_dtype for this, seems like the most generic fixture available for this case
I think there exists one that includes numpy dtypes, but this would work too
I think there exists one that includes numpy dtypes, but this would work too
besides that I still don't find a unified extension array + numpy dtype fixture it seems like this wouldn't work for np.int32 and np.float32 since these are implicitly cast to their 64 bit counterparts
pd.CategoricalDtype(np.array([1, 2, 3], dtype=np.int32)).__dict__
>>> {'_categories': Int64Index([1, 2, 3], dtype='int64'), '_ordered': False}
Could you try Index([1, 2, 3], dtype=int32), this should work
Index([1, 2, 3], dtype=int32)
Same issue:
pd.Index([1, 2, 3], dtype='int32')
>>> Int64Index([1, 2, 3], dtype='int64')
Hm ok, then ea dtypes only
Any specific reasons for closing and not merging?
Äh no, sorry for that. Pressed the wrong button
merged #48106, thx @CloseChoice