vaex icon indicating copy to clipboard operation
vaex copied to clipboard

Proposed correction for category_values()

Open AlenkaF opened this issue 2 years ago • 6 comments

Thi PR is a proposed correction for category_values() function on the dataframe class. When using df.category_values(column) an error accurs while accessing a list item that doesn't exist (list _category, item 'values')

df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
df = df.categorize('year', min_value=2012, max_value=2019)
df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

from vaex.utils import _ensure_string_from_expression
column = _ensure_string_from_expression(df.year)

df._categories[column]

gives the list as an output

{'labels': [2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019],
 'N': 8,
 'min_value': 2012}

and calling category_values() function

df.category_values(column)

result in an error

    348     def category_values(self, column):
    349         column = _ensure_string_from_expression(column)
--> 350         return self._categories[column]['values']
    351 
    352     def category_count(self, column):

KeyError: 'values'

I suggest using self.columns[column] instead of self._categories[column]['values'] in the category_values function.

Versions used: Python 3.9.5, vaex-core: 4.3.0.post1, vaex-viz: 0.5.0, vaex-hdf5: 0.8.0, vaex-server: 0.5.0, vaex-astro: 0.8.2, vaex-jupyter: 0.6.0, vaex-ml: 0.12.0

AlenkaF avatar Jul 27 '21 12:07 AlenkaF

Hi @AlenkaF !

Thanks for catching this! I had a quick look and there seem like there are a few things wrong here (around the categorize method).

I have to say - i am not sure what the category_values method is meant to return.. could even be deprecated? I am quite sure users should not be exposed to df.columns. I'll look where and how this is used internally. @maartenbreddels will know better.

I also noticed that df.category_labels does not return a list even if aslist=True. We should fix that also.

_category_dictionary should return a meaningful error when the column passed is not of arrow type.

You are welcome to have a look at any of these points!

JovanVeljanoski avatar Jul 27 '21 12:07 JovanVeljanoski

Hi @JovanVeljanoski!

Thank you for your quick reply.

Hi @AlenkaF !

Thanks for catching this! I had a quick look and there seem like there are a few things wrong here (around the categorize method).

I have to say - i am not sure what the category_values method is meant to return.. could even be deprecated? I am quite sure users should not be exposed to df.columns. I'll look where and how this is used internally. @maartenbreddels will know better.

In case of categorical column it makes sense IMO to have category_labels and category_values functions as they can be used similar to how Pandas uses .codes and categories.values. If df.columns isn't a good option maybe simple self[column].values would be better?

I also noticed that df.category_labels returns does not return a list even if aslist=True. We should fix that also.

_category_dictionary should return a meaningful error when the column passed is not of arrow type.

You are welcome to have a look at any of these points!

I would be happy to try and fix df.category_labels and look into _category_dictionary. Should I make a seperate PR for them?

One more question: is there a way to test if expression is categorical or not? Or can this only be done on the dataframe level?

AlenkaF avatar Jul 27 '21 13:07 AlenkaF

Ah yeah, I forgot to mention..

all the bugs you have found should be exposed as unit-tests before they are fixed :) (so we can develop forward without fear).
Feel free to put tests in tests/category_test.py

There is a method to df.is_category() to check if a column is categorical or not.

In case of categorical column it makes sense IMO to have category_labels and category_values functions as they can be used similar to how Pandas uses .codes and categories.values. If df.columns isn't a good option maybe simple self[column].values would be better?

Regarding this, I'd like the opinion of @maartenbreddels since I am not 100% sure how or if these methods are used internally. Of course you can check this yourself (i'm worried about breaking changes).

Yeah if you keep to this topic around categorization, i think a single PR is fine, all these things are related.

JovanVeljanoski avatar Jul 27 '21 13:07 JovanVeljanoski

Ah yeah, I forgot to mention..

all the bugs you have found should be exposed as unit-tests before they are fixed :) (so we can develop forward without fear). Feel free to put tests in tests/category_test.py

Oh, sorry about that. Donno how I missed it. Will do.

There is a method to df.is_category() to check if a column is categorical or not.

If I understand correctly df.is_category() needs to be called on the dataframe level. If I only have a column is there a way to check the type without referring back to the whole df?

In case of categorical column it makes sense IMO to have category_labels and category_values functions as they can be used similar to how Pandas uses .codes and categories.values. If df.columns isn't a good option maybe simple self[column].values would be better?

Regarding this, I'd like the opinion of @maartenbreddels since I am not 100% sure how or if these methods are used internally. Of course you can check this yourself (i'm worried about breaking changes).

Yeah if you keep to this topic around categorization, i think a single PR is fine, all these things are related.

I will check for internal use of the method and see if I find anything.

Thank you for your help!

AlenkaF avatar Jul 28 '21 05:07 AlenkaF

In vaex (unlike in say pandas and other dataframe libraries), an expression (or a column) is always coupled to a particular dataframe. You can't have "free floating" expressions, that independent of any dataframe.

For example this core works:

import vaex
df = vaex.example()
x = df.x
print(x.df.is_category('x'))

If for example you'd like to have a function that only takes an expression(s) as an argument and you don't want to pass the dataframe around explicitly.

Feel free to ask any questions! (here or in slack). Thanks!

JovanVeljanoski avatar Jul 28 '21 06:07 JovanVeljanoski

Deleted the branch by accident and the PR got closed. Reopening it now.

AlenkaF avatar Aug 11 '21 09:08 AlenkaF