arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

make unique and <Series>.value_counts() work with boolean dtype

Open kellyjoy15 opened this issue 2 years ago • 4 comments

IMO, this should not require a call into the sever value_counts message. This could be achieved with a True-reduction and a subtraction from the length to get the counts for False.

kellyjoy15 avatar May 02 '22 21:05 kellyjoy15

@kellyjoy15, I was not having any issues using boolean values in a series running .value_counts(). GroupBy, which handles the computations here, was updated in the latest release to handle boolean dtype, Is it possible you're not using the latest release or would you still like for this functionality to be modified?

Here's my test case and output:

In [35]: s
Out[35]:
0     False
1      True
2      True
3      True
4     False
5      True
6      True
7      True
8     False
9     False
10     True
11     True
12    False
13     True
14    False
15    False
16     True
17     True
18    False
19    False
20    False
21    False
22    False
23    False
24     True
dtype: bool
In [40]: s.value_counts()
Out[40]:
False    13
True     12
dtype: int64

joshmarshall1 avatar May 05 '22 13:05 joshmarshall1

So, I am running v2022.04.15

GroupBy and groupby work with bools in that version, so I don't know if it is still a problem.

Here is my reproducer, along with other things I tried and the result:

a = ak.array([True, False, True,True])
s = ak.Series(a)
ak.value_counts(s)
# TypeError: type of argument "pda" must be arkouda.pdarrayclass.pdarray; got arkouda.series.Series instead

s.value_counts()
# TypeError: type of argument "pda" must be arkouda.pdarrayclass.pdarray; got numpy.bool_ instead

ak.value_counts(a)
# RuntimeError: Error: unique: bool not implemented

a.value_counts()
# AttributeError: 'pdarray' object has no attribute 'value_counts'

s.unique()
# AttributeError: 'Series' object has no attribute 'value_counts'

ak.unique(s)
# TypeError: must be pdarray, Strings, or Categorical {}

a.unique()
# AttributeError: 'pdarray' object has not attribute 'unique'

ak.unique(a)
# RuntimeError: Error: unique: bool not implemented


kellyjoy15 avatar May 05 '22 14:05 kellyjoy15

@kellyjoy15, this got a bit long so TL;DR

  • workaround for the meantime
    >>> a = ak.array([True, False, True,True])
    >>> g = ak.GroupBy(a)
    >>> g.count()
    (array([False True]), array([1 3]))
    
  • Series expects a Tuple of pdarrays i.e. s = ak.Series((ak.arange(a.size), a))
  • unique doesn't support bool arrays in 'v2022.04.15' this is fixed in 'v2022.05.05'

I think there are few things at play here,

  • The input is not what Series expects https://github.com/Bears-R-Us/arkouda/blob/67a5b542e28647bcc5d4b283b8f3e4bcfc006191/arkouda/series.py#L74-L77 So when ak.Series(a) is called it's expects index to be the first element and values to be the second element. So index is set to True and values is set to False

    >>> a = ak.array([True, False, True,True])
    >>> s = ak.Series(a)
    >>> s.index
    True
    >>> s.values
    False
    >>> s
    AttributeError: 'numpy.bool_' object has no attribute 'to_ndarray'
    

    This explains the # TypeError: type of argument "pda" must be arkouda.pdarrayclass.pdarray; got numpy.bool_ instead. This is similar to the issue seen in #1267, and we need more robust type checking to error when the Series is intialized

  • unique doesn't support bool in 'v2022.04.15'. This was fixed and is in today's release 'v2022.05.05', and will hopefully make its way to you soon.

    >>> ak.get_config()['arkoudaVersion']
    'v2022.04.15'
    
    >>> a = ak.array([True, False, True,True])
    >>> s = ak.Series((ak.arange(a.size), a))
    >>> s.index
    array([0 1 2 3])
    >>>s.values
    array([True False True True])
    
    >>> s.value_counts()
    RuntimeError: Error: unique: bool not implemented
    >>> ak.value_counts(a)
    RuntimeError: Error: unique: bool not implemented
    >>> ak.unique(a)
    RuntimeError: Error: unique: bool not implemented
    
  • The rest of the errors appear to be attribute methods that aren't implemented or functions that don't accept Series as a paramter type. I'll leave #1352 to decide if we should modify groupable types to accept series

in the latest release it looks like this (forgive me for the reordering, I tried to group them in a way that was logical to me)

>>> ak.get_config()['arkoudaVersion']
'v2022.05.05'

>>> a = ak.array([True, False, True,True])
>>> s = ak.Series((ak.arange(a.size), a))
>>> s
0     True
1    False
2     True
3     True
dtype: bool

# only way that works in 'v2022.04.15'
>>> ak.GroupBy(a).count()
(array([False True]), array([1 3]))

# works only in 'v2022.05.05'
>>> s.value_counts()
True     3
False    1
dtype: int64

>>> ak.value_counts(a)
(array([False True]), array([1 3]))

>>> ak.unique(a)
array([False True])

# methods not implemented
>>> s.unique()
AttributeError: 'Series' object has no attribute 'unique'

>>> a.unique()
AttributeError: 'pdarray' object has no attribute 'unique'

>>> a.value_counts()
AttributeError: 'pdarray' object has no attribute 'value_counts'

# Series not currently an accepted type
>>> ak.value_counts(s)
TypeError: type of argument "pda" must be arkouda.pdarrayclass.pdarray; got arkouda.series.Series instead

>>> ak.unique(s)
TypeError: <class 'arkouda.series.Series'> does not support grouping

stress-tess avatar May 05 '22 21:05 stress-tess

it's worth noting that what you expected to happen is totally reasonable, as that's how it's handled in pandas. We should probably update Series to automatically make the index if only values is passed in

>>> pd.Series([True,False,True,True])
0     True
1    False
2     True
3     True
dtype: bool

This discrepancy is captured in #1363

stress-tess avatar May 05 '22 21:05 stress-tess

@pierce314159 checked and this is now functional.

Ethan-DeBandi99 avatar Nov 29 '22 18:11 Ethan-DeBandi99