pandas icon indicating copy to clipboard operation
pandas copied to clipboard

FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error

Open CloseChoice opened this issue 3 years ago • 24 comments

  • [x] closes #47677
  • [x] Tests added and passed if fixing a bug or adding a new feature
  • [x] All code checks passed.
  • [x] Added type annotations to new arguments/methods/functions.
  • [x] Added an entry in the latest doc/source/whatsnew/v1.5.0.rst file if fixing a bug or adding a new feature.

CloseChoice avatar Jul 16 '22 16:07 CloseChoice

What happens when a string is used for enlarging? Like a

The series is also cast to object, which might not be the behaviour we want. But this also happens previous to the regression (tested on v1.4.3).

CloseChoice avatar Jul 16 '22 17:07 CloseChoice

Yeah this is also a bug then. Could you check if the categorical dtypes match and then return the dtype accordingly?

phofl avatar Jul 16 '22 17:07 phofl

Yeah this is also a bug then. Could you check if the categorical dtypes match and then return the dtype accordingly?

How should we tackle that problem? If the enlarging value can be cast to categorical we should end up with a categorical series (so enlarging with 0 as in the issue also yields a categorical series)? Or only if the enlarging value is already one of the categories?

CloseChoice avatar Jul 16 '22 17:07 CloseChoice

How does concat behave in that case?

phofl avatar Jul 16 '22 17:07 phofl

How does concat behave in that case?

Like this

import pandas as pd
s = pd.Series(["a", "b", "c"], dtype="category")
t = pd.concat([s, pd.Series(["a"], index=[3])])  # dtype: object
t2 = pd.concat([s, pd.Series(["a"], index=[3], dtype="category")])  # dtype: object

I would consider this fine for t but not for t2. Note that I checked this for v1.4.3 and my fix and the behaviour is the same for both.

CloseChoice avatar Jul 17 '22 08:07 CloseChoice

Yeah I think t2 is off

phofl avatar Jul 17 '22 10:07 phofl

Could you try t2 with categories specified as a,b, c?

phofl avatar Jul 17 '22 11:07 phofl

Could you try t2 with categories specified as a,b, c?

This is interesting:

t = pd.concat([s, pd.Series(["a"])]) # dtype object
t2 = pd.concat([s, pd.Series(["a"], dtype="category")])  # dtype object
t3 = pd.concat([s, pd.Series(["a", "b"], dtype="category")]) # dtype object
t4 = pd.concat([s, pd.Series(["a", "b", "c"], dtype="category")])  # dtype category
t5 = pd.concat([s, pd.Series(["a", "b", "a", "b", "c"], dtype="category")]) #dtype category
t6 = pd.concat([s, pd.Series(["a", "b", "a", "b", "a"], dtype="category")]) #dtype object
t7 = pd.concat([s, pd.Series(["a", "b", "d"], dtype="category")])  # dtype object

Only if the categories are match exactly we preserve the category type. This looks really buggy. But this is not a regression, checked it on v1.4.3, main and my fix and the behaviour is identical on all. Should we create a seperate issue for this?

CloseChoice avatar Jul 17 '22 13:07 CloseChoice

No this make sense, the dtypes are not equal with different categories. But enlargement with a scalar is a different case, we should preserve categorical dtype there. We have to create a series with the correct categories in that case

phofl avatar Jul 17 '22 13:07 phofl

No this make sense, the dtypes are not equal with different categories. But enlargement with a scalar is a different case, we should preserve categorical dtype there. We have to create a series with the correct categories in that case

I updated the PR to preserve the categorical dtype if the enlarging element is already in the categories. I needed a special case in _setitem_with_indexer_missing for this since we don't want to change concat_compat for this.

Please note that in

s = pd.DataFrame([["a", "b"], ["c", "k"]], dtype="category")
s.loc[3] = "a"

both axis are cast to object. This behaviour was not changed. But to be consistent I would expect, that the first column should stay categorical while the second should change.

CloseChoice avatar Jul 17 '22 15:07 CloseChoice

DataFrame cases don't work yet, this is correct. But this has multiple issues still, so lets focus on Series here

phofl avatar Jul 18 '22 00:07 phofl

Could you also add a test where we enlarge with nan? This is not part of the categories but should probably work? Not sure

phofl avatar Jul 18 '22 01:07 phofl

Could you also add a test where we enlarge with nan? This is not part of the categories but should probably work? Not sure

Yep, it works in the sense that after enlarging with nan the series has dtype object. But I think that's fine. I added the test

CloseChoice avatar Jul 18 '22 06:07 CloseChoice

I was referring to the opposite.

result = Series([1, 2, 3], dtype=pd.CategoricalDtype(categories=pd.Index([1, 2, 3])))
result.loc[1] = np.nan  # np.nan and pd.NA

this keeps categorical dtype, e.g. enlargement should too. Could you add tests for enlarging and setting into it to test that they are consistent?

Could you also add tests so that the integer dtype is kept like above? e.g. Int64 stays Int64, Int32 stays Int32 and so on. (we have fixtures for that)

phofl avatar Jul 18 '22 10:07 phofl

you also add tests so that the integer dtype is kept like above? e.g. Int64 stays Int64, Int32 stays Int32 and so on. (we have fixtures for that)

Found one fixture in tests/groupby/test_function.py called dtypes_for_minmax but that doesn't look for reusing in other files. Also didn't check for np.int32 and np.float32 since

pd.CategoricalDtype(pd.array([1, 2, 3], dtype=np.int32)).__dict__
>>> {'_categories': Int64Index([1, 2, 3], dtype='int64'), '_ordered': False}

these are cast implicitly.

CloseChoice avatar Jul 18 '22 16:07 CloseChoice

Its called something like any_numeric_dtype

phofl avatar Jul 18 '22 16:07 phofl

Its called something like any_numeric_dtype

used any_numeric_ea_dtype for this, seems like the most generic fixture available for this case

CloseChoice avatar Jul 19 '22 13:07 CloseChoice

I think there exists one that includes numpy dtypes, but this would work too

phofl avatar Jul 19 '22 13:07 phofl

I think there exists one that includes numpy dtypes, but this would work too

besides that I still don't find a unified extension array + numpy dtype fixture it seems like this wouldn't work for np.int32 and np.float32 since these are implicitly cast to their 64 bit counterparts

pd.CategoricalDtype(np.array([1, 2, 3], dtype=np.int32)).__dict__
>>> {'_categories': Int64Index([1, 2, 3], dtype='int64'), '_ordered': False}

CloseChoice avatar Jul 19 '22 13:07 CloseChoice

Could you try Index([1, 2, 3], dtype=int32), this should work

phofl avatar Jul 19 '22 13:07 phofl

Index([1, 2, 3], dtype=int32)

Same issue:

pd.Index([1, 2, 3], dtype='int32')
>>> Int64Index([1, 2, 3], dtype='int64')

CloseChoice avatar Jul 19 '22 13:07 CloseChoice

Hm ok, then ea dtypes only

phofl avatar Jul 19 '22 13:07 phofl

Any specific reasons for closing and not merging?

CloseChoice avatar Jul 19 '22 14:07 CloseChoice

Äh no, sorry for that. Pressed the wrong button

phofl avatar Jul 19 '22 14:07 phofl

merged #48106, thx @CloseChoice

phofl avatar Aug 16 '22 21:08 phofl