pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] transform_column: transform multiple columns

Open mdini opened this issue 5 years ago • 29 comments

mdini avatar May 07 '19 18:05 mdini

@mdini if you'd like to take on this issue, please let us know here!

ericmjl avatar May 08 '19 02:05 ericmjl

@ericmjl yes, I'd like to.

mdini avatar May 08 '19 13:05 mdini

Awesome stuff! Just marked this as being worked on.

Thank you for all of your contributions :smile:.

ericmjl avatar May 08 '19 13:05 ericmjl

And thank you for the great sprint and this great package :-)

mdini avatar May 08 '19 14:05 mdini

Hi @mdini Does this have a different functionality than transform_columns ?

jk3587 avatar May 08 '19 14:05 jk3587

@jk3587 Sorry i didn't see there is a function transform_columns. But why do we have transform_column then? In my opinion transform_columns is enough and the user can call transform_columns with a list of one element.

mdini avatar May 08 '19 15:05 mdini

But why do we have transform_column then?

@mdinin this was a design choice I made early on, which in retrospect may not have been the best. I had created the singular and plural versions of a function where relevant. Maybe it's better to only provide the plural version? @jk3587 what are your thoughts here?

I know that @zbarry and @szuckerman raised the same questions and didn't mind it at first, but looking at the API and how much it might grow, I think it's best to revisit this question once again.

In my opinion transform_columns is enough and the user can call transform_columns with a list of one element.

This is definitely a good API suggestion. I'm thinking that for transform_columns, a few example nice API might be as follows:

# transform only one column, while creating a new column name for it.
df.transform_columns(column_name=['col1'], function=np.abs, new_column_name=['col1_abs'])

# transform multiple columns by the same function, without creating a new column name.
df.transform_columns(column_name=['col1', 'col2'], function=np.abs)

# transform multiple columns by the same function, without creating a new column name.
df.transform_columns(column_name=['col1', 'col2'], function=np.abs, new_column_name=['col1_abs', 'col2_abs'])

# transform multiple columns, each with their own function, and new column names.
def negative(x):
    return -x

# note here that column_name is not provided, because it is present in the mapping.
df.transform_columns(
    mapping={
       # structure of dictionary:
       # <original column name>: (<function>, <new column name or None>) 
        'col1': (np.abs, "col1_abs"),
        'col2': (negative, None),  # do not change this column name
)

This API should cover most of the use cases; in particular, use of the mapping kwarg is the most general, but the use of the other kwargs would make it easier to read for single- or few-column transformations. What do you all think?

ericmjl avatar May 08 '19 15:05 ericmjl

But why do we have transform_column then?

@mdinin this was a design choice I made early on, which in retrospect may not have been the best. I had created the singular and plural versions of a function where relevant. Maybe it's better to only provide the plural version? @jk3587 what are your thoughts here?

I know that @zbarry and @szuckerman raised the same questions and didn't mind it at first, but looking at the API and how much it might grow, I think it's best to revisit this question once again.

In my opinion transform_columns is enough and the user can call transform_columns with a list of one element.

This is definitely a good API suggestion. I'm thinking that for transform_columns, a few example nice API might be as follows:

# transform only one column, while creating a new column name for it.
df.transform_columns(column_name=['col1'], function=np.abs, new_column_name=['col1_abs'])

# transform multiple columns by the same function, without creating a new column name.
df.transform_columns(column_name=['col1', 'col2'], function=np.abs)

# transform multiple columns by the same function, without creating a new column name.
df.transform_columns(column_name=['col1', 'col2'], function=np.abs, new_column_name=['col1_abs', 'col2_abs'])

# transform multiple columns, each with their own function, and new column names.
def negative(x):
    return -x

# note here that column_name is not provided, because it is present in the mapping.
df.transform_columns(
    mapping={
       # structure of dictionary:
       # <original column name>: (<function>, <new column name or None>) 
        'col1': (np.abs, "col1_abs"),
        'col2': (negative, None),  # do not change this column name
)

This API should cover most of the use cases; in particular, use of the mapping kwarg is the most general, but the use of the other kwargs would make it easier to read for single- or few-column transformations. What do you all think?

While working on a TidyTuesday notebook example, I came across R's mutate function which seems to be essentially the transform_columns that is currently implemented but being allowed to transform multiple columns with different functions.

# Removes brackets from some columns
clean_df <- raw_df %>% 
  # Removes brackets from a string columns 'producers' and 'genre'
  mutate(producers = str_remove(producers, "\\["),
         producers = str_remove(producers, "\\]"),
         genre = str_remove(genre, "\\["),
         genre = str_remove(genre, "\\]"))

The pandas documentation has a page comparing R vs pandas functions and it shows mutate is most similar to df.assign link. Documentation for df.assign

Looking at the documentation for df.assign it seems that, for the most part, it does what transform_columns does. The part that I'm confused about is that the return value is a New DataFrame with the modified or new columns. @ericmjl Does the current implementation of transform_column return a new DataFrame with the modified columns or does it return the original DataFrame with modified columns?

Lastly, for small datasets, a wrapper around df.assign with a col: df.str function mapping could help with #273. Here is a link of such implementation. Not sure how this would affect performance for larger DataFrames since it's returning a new DataFrame.

jk3587 avatar May 08 '19 16:05 jk3587

I like the idea of only having the plural version of the function. To still give it the same functionality of a "singular" function, there's a few options:

  1. Allow column and columns arguments; column would take a string instead of a list and would work like transform_column.
  2. Only allow a columns argument, but if a user inputs a string instead of a list you check it with something like the following:
if isinstance(columns, str):
    columns = [columns]

szuckerman avatar May 08 '19 16:05 szuckerman

So, is it agreed then that there should only be the plural of each function? Keen to get the final stance on this and see if I can take it on

samukweku avatar Jun 27 '20 22:06 samukweku

@samukweku it's an important API decision. What's your thought?

ericmjl avatar Jun 27 '20 22:06 ericmjl

I agree with the plural form of the functions. Less number of functions to keep track of, and the plural form still remains intuitive to the end user. Plural wins!

I will go with @szuckerman option 2 - just a single columns option and we can pass a string or a list of columns. In the backend we can split it into a list of columns, if the string is a combination of columns

samukweku avatar Jun 27 '20 22:06 samukweku

Sounds good. This is a breaking change for the API, btw, so be sure to follow "deprecation practices". If I remember correctly, @hectormz put in a decorator for deprecating stuff. You might want to see if (a) it can be reused, or (b) a similar thing can be built for this situation (i.e. "deprecation of singular"). I'll probably need a warning to remind myself for a few versions, as I've gotten used to typing df.transform_column("column_name", function_name, "new_column_name").

Actually, I think the API used in the function find_replace could be a good source of inspiration:

df.transform_columns(col1={"col1_log": np.log10, "col1_neg": lambda x: -x})

Or more generally:

df.transform_columns(column_name={"new_column_name": function})

What are your thoughts? Perhaps keeping transform_column around for legacy reasons (until we go 1.x) would be good, and encouraging more of the aforementioned pattern (column_name={new_name: func}) would be a nice direction for the API to evolve!

ericmjl avatar Jun 27 '20 23:06 ericmjl

Bringing this up here as well. I think Pandas' transform function is sufficient and does pretty much all that transform column(s) does. I suggest we slate the function for deprecation - unless someone has some use cases that Pandas transform cannot pull off.

samukweku avatar Aug 18 '20 12:08 samukweku

If I am reading the pandas docs correctly, I think transform only operates on the entire dataframe. Does it allow selective column transformations?

ericmjl avatar Aug 18 '20 16:08 ericmjl

Yes, it does allow selective column transformations, although playing more with it, I see that you have to reassign the values to the dataframe :

df = pd.DataFrame(
[
    ["Juan", 0, 0, 400, 450, 500],
    ["Luis", 100, 100, 100, 100, 100],
    ["Maria", 0, 20, 50, 300, 500],
    ["Laura", 0, 0, 0, 100, 900],
    ["Lina", 0, 0, 0, 0, 10],
],
columns=["Name", "Date1", "Date2", "Date3", "Date4", "Date5"],
)

df

df.transform({'Date1':np.abs, 'Date3':lambda x: x+1, 'Date4' : np.sqrt})
    Date1 | Date3 | Date4
        0 | 401   | 21.213203
      100 | 101   | 10.000000
        0 | 51    | 17.320508
        0 | 1     | 10.000000
        0 | 1     | 0.000000

samukweku avatar Aug 18 '20 21:08 samukweku

Cool stuff! Thanks @samukweku for figuring this out 😄 🎉!

With transform_column, looks like we're covering the case where we want to selectively transform a bunch of columns while retaining anything that isn't explicitly passed in. This is a value-add on top of what df.transform does. We are also able to transform the column into a new column this way too. The semantics of transform are such that the resulting dataframe columns are taken from the keys passed in, which I think might be a bit too restrictive for the broader use case that transform_columns solves.

I'd like to propose something along this lines then:


identity = lambda x: x

# Just a sketch, implementation needs to be fleshed out
def transform_columns(df, function_mapping: Dict, new_column_names: Dict):
    # First off, make a function mapping with "identity" as the default
    functions = {c: identity for c in df.columns}
    functions.update(function_mapping)
    
    # Next, transform the dataframe.
    new_df = df.transform(functions)

    # Next, rename the columns based on what's in new_column_names
    new_df = new_df.rename(new_column_names, axis="columns")

    # Finally, put back any columns from the original that are not intended to be overwritten.
    new_df = ... # something... mind is blank on this
    return new_df

This is kind of a nice programming puzzle to figure out, I think. It might also help eliminate future "semantics debt", where the semantics of a function get muddied over time. The implementation is also flat ("flat is better than nested"), delegates to pandas what ought to be delegated to pandas, and should be more maintainable in the long-run.

The only downside is that this is an API change, so I evolving the API would mean we should preserve the API now, add the new API as keyword arguments, add an API deprecation warning, mark the old chunk of code to be deprecated using comments, and only in the 1.x release deprecate the old code.

What do you all think?

ericmjl avatar Aug 19 '20 16:08 ericmjl

@ericmjl do you mind sharing an example, so I can understand better your proposition? Confused about identity and the function mapping

samukweku avatar Aug 20 '20 00:08 samukweku

Yep, definitely @samukweku!

I posted something here. Please let me know what you think of it!

cc: @sallyhong

ericmjl avatar Aug 20 '20 00:08 ericmjl

Thanks for sharing the example! Helps a lot! (Quick note: Cell6, the column rename didn't work. I think you have to put 'date4'--no caps)

Just wanted to confirm, for the second method style, how would you "transform" multiple columns when you want to rename some columns but not the other? (That's the main reason I like the first method style.)

I also want think we should make sure that we preserve the column order of the original dataframe in the output. (Cell 8 output shows otherwise)

sallyhong avatar Aug 20 '20 01:08 sallyhong

As a side note, what if we used a new name mutate/transmute (nothing original in those names), that way we avoid the single/plural verb names, while still using a synonym that captures what we mean?

samukweku avatar Aug 20 '20 07:08 samukweku

Just wanted to confirm, for the second method style, how would you "transform" multiple columns when you want to rename some columns but not the other? (That's the main reason I like the first method style.) I also want think we should make sure that we preserve the column order of the original dataframe in the output. (Cell 8 output shows otherwise)

@sallyhong I tried an alternative implementation, leveraging another function we have in the library. Check it out on this gist.

As a side note, what if we used a new name mutate/transmute (nothing original in those names), that way we avoid the single/plural verb names, while still using a synonym that captures what we mean?

@samukweku that's a cool idea. I think we might be hitting on the semantics of R's mutate.

What do both of you think about the 2nd implementation (transform_column_v3) in the gist? @sallyhong, would you like to give it a shot, sort of as a way of getting back into contributing? 😄

ericmjl avatar Aug 21 '20 00:08 ericmjl

Hi, just assigned this to myself. I'll take a shot at this, this weekend! I want to study the R equivalent too to glean any other ideas.

Thanks!

sallyhong avatar Aug 21 '20 01:08 sallyhong

Hi!

I attempted a v4.

Link to Gist

I put some scenarios at the bottom. Please take a look and I'd love to hear feedback from anyone (about anything!) Thanks :)

sallyhong avatar Sep 14 '20 00:09 sallyhong

@sallyhong I like the ideas. Just thinking about the append_new argument; it makes the transform_columns function more like an assign function. My thought is that the transform function should just transform(mutate) the columns, and not create new columns. It also seems like a good option. Just my opinion, others might have a different opinion.

samukweku avatar Sep 14 '20 11:09 samukweku

Very good point @samukweku . I was wrestling with this as R has two functions (mutate vs mutate_at) depending on whether or not you want to create a new column or overwrite the old column.

I decided that I didn't like the idea of changing a column contents without knowing that I did (e.g. providing a transform_dict with append_new=False, and wanted it as a default to clearly indicate which columns were changed.

Or maybe we can just raise a warning if a user changes a column without a name and suggest they use df.assign for those columns instead? Not sure, but I'm not too attached to my code so I'm happy to change/update it :)

sallyhong avatar Sep 14 '20 13:09 sallyhong

@sallyhong I like what you've done there! I am also wondering, what if the default suffix was not a fixed label, but instead the __name__ attribute of the function passed in? This would be magical :tada: for users. For functions that are anonymous (they don't have a __name__ attribute), such as lambdas, the _new can be the baseline default.

ericmjl avatar Sep 18 '20 23:09 ericmjl

Thanks @ericmjl. I never knew of the __name__ attribute. Let me do some research and learn about this!

sallyhong avatar Sep 19 '20 20:09 sallyhong

@ericmjl , great idea! I tried it out with a test function too (named "add_two")

https://gist.github.com/sallyhong/627db0810ee917df0f978b8557245a61

sallyhong avatar Oct 01 '20 02:10 sallyhong