pyjanitor
pyjanitor copied to clipboard
[ENH] transform_column: transform multiple columns
@mdini if you'd like to take on this issue, please let us know here!
@ericmjl yes, I'd like to.
Awesome stuff! Just marked this as being worked on.
Thank you for all of your contributions :smile:.
And thank you for the great sprint and this great package :-)
Hi @mdini Does this have a different functionality than transform_columns
?
@jk3587 Sorry i didn't see there is a function transform_columns. But why do we have transform_column then? In my opinion transform_columns is enough and the user can call transform_columns with a list of one element.
But why do we have transform_column then?
@mdinin this was a design choice I made early on, which in retrospect may not have been the best. I had created the singular and plural versions of a function where relevant. Maybe it's better to only provide the plural version? @jk3587 what are your thoughts here?
I know that @zbarry and @szuckerman raised the same questions and didn't mind it at first, but looking at the API and how much it might grow, I think it's best to revisit this question once again.
In my opinion transform_columns is enough and the user can call transform_columns with a list of one element.
This is definitely a good API suggestion. I'm thinking that for transform_columns
, a few example nice API might be as follows:
# transform only one column, while creating a new column name for it.
df.transform_columns(column_name=['col1'], function=np.abs, new_column_name=['col1_abs'])
# transform multiple columns by the same function, without creating a new column name.
df.transform_columns(column_name=['col1', 'col2'], function=np.abs)
# transform multiple columns by the same function, without creating a new column name.
df.transform_columns(column_name=['col1', 'col2'], function=np.abs, new_column_name=['col1_abs', 'col2_abs'])
# transform multiple columns, each with their own function, and new column names.
def negative(x):
return -x
# note here that column_name is not provided, because it is present in the mapping.
df.transform_columns(
mapping={
# structure of dictionary:
# <original column name>: (<function>, <new column name or None>)
'col1': (np.abs, "col1_abs"),
'col2': (negative, None), # do not change this column name
)
This API should cover most of the use cases; in particular, use of the mapping
kwarg is the most general, but the use of the other kwargs would make it easier to read for single- or few-column transformations. What do you all think?
But why do we have transform_column then?
@mdinin this was a design choice I made early on, which in retrospect may not have been the best. I had created the singular and plural versions of a function where relevant. Maybe it's better to only provide the plural version? @jk3587 what are your thoughts here?
I know that @zbarry and @szuckerman raised the same questions and didn't mind it at first, but looking at the API and how much it might grow, I think it's best to revisit this question once again.
In my opinion transform_columns is enough and the user can call transform_columns with a list of one element.
This is definitely a good API suggestion. I'm thinking that for
transform_columns
, a few example nice API might be as follows:# transform only one column, while creating a new column name for it. df.transform_columns(column_name=['col1'], function=np.abs, new_column_name=['col1_abs']) # transform multiple columns by the same function, without creating a new column name. df.transform_columns(column_name=['col1', 'col2'], function=np.abs) # transform multiple columns by the same function, without creating a new column name. df.transform_columns(column_name=['col1', 'col2'], function=np.abs, new_column_name=['col1_abs', 'col2_abs']) # transform multiple columns, each with their own function, and new column names. def negative(x): return -x # note here that column_name is not provided, because it is present in the mapping. df.transform_columns( mapping={ # structure of dictionary: # <original column name>: (<function>, <new column name or None>) 'col1': (np.abs, "col1_abs"), 'col2': (negative, None), # do not change this column name )
This API should cover most of the use cases; in particular, use of the
mapping
kwarg is the most general, but the use of the other kwargs would make it easier to read for single- or few-column transformations. What do you all think?
While working on a TidyTuesday notebook example, I came across R's mutate
function which seems to be essentially the transform_columns
that is currently implemented but being allowed to transform multiple columns with different functions.
# Removes brackets from some columns
clean_df <- raw_df %>%
# Removes brackets from a string columns 'producers' and 'genre'
mutate(producers = str_remove(producers, "\\["),
producers = str_remove(producers, "\\]"),
genre = str_remove(genre, "\\["),
genre = str_remove(genre, "\\]"))
The pandas documentation has a page comparing R vs pandas functions and it shows mutate
is most similar to df.assign
link.
Documentation for df.assign
Looking at the documentation for df.assign
it seems that, for the most part, it does what transform_columns
does. The part that I'm confused about is that the return value is a New DataFrame with the modified or new columns.
@ericmjl Does the current implementation of transform_column
return a new DataFrame with the modified columns or does it return the original DataFrame with modified columns?
Lastly, for small datasets, a wrapper around df.assign
with a col: df.str
function mapping could help with #273. Here is a link of such implementation. Not sure how this would affect performance for larger DataFrames since it's returning a new DataFrame.
I like the idea of only having the plural version of the function. To still give it the same functionality of a "singular" function, there's a few options:
- Allow
column
andcolumns
arguments;column
would take a string instead of a list and would work liketransform_column
. - Only allow a
columns
argument, but if a user inputs a string instead of a list you check it with something like the following:
if isinstance(columns, str):
columns = [columns]
So, is it agreed then that there should only be the plural of each function? Keen to get the final stance on this and see if I can take it on
@samukweku it's an important API decision. What's your thought?
I agree with the plural form of the functions. Less number of functions to keep track of, and the plural form still remains intuitive to the end user. Plural wins!
I will go with @szuckerman option 2 - just a single columns
option and we can pass a string or a list of columns. In the backend we can split it into a list of columns, if the string is a combination of columns
Sounds good. This is a breaking change for the API, btw, so be sure to follow "deprecation practices". If I remember correctly, @hectormz put in a decorator for deprecating stuff. You might want to see if (a) it can be reused, or (b) a similar thing can be built for this situation (i.e. "deprecation of singular"). I'll probably need a warning to remind myself for a few versions, as I've gotten used to typing df.transform_column("column_name", function_name, "new_column_name")
.
Actually, I think the API used in the function find_replace
could be a good source of inspiration:
df.transform_columns(col1={"col1_log": np.log10, "col1_neg": lambda x: -x})
Or more generally:
df.transform_columns(column_name={"new_column_name": function})
What are your thoughts? Perhaps keeping transform_column
around for legacy reasons (until we go 1.x) would be good, and encouraging more of the aforementioned pattern (column_name={new_name: func}
) would be a nice direction for the API to evolve!
Bringing this up here as well. I think Pandas' transform
function is sufficient and does pretty much all that transform column(s) does. I suggest we slate the function for deprecation - unless someone has some use cases that Pandas transform cannot pull off.
If I am reading the pandas docs correctly, I think transform only operates on the entire dataframe. Does it allow selective column transformations?
Yes, it does allow selective column transformations, although playing more with it, I see that you have to reassign the values to the dataframe :
df = pd.DataFrame(
[
["Juan", 0, 0, 400, 450, 500],
["Luis", 100, 100, 100, 100, 100],
["Maria", 0, 20, 50, 300, 500],
["Laura", 0, 0, 0, 100, 900],
["Lina", 0, 0, 0, 0, 10],
],
columns=["Name", "Date1", "Date2", "Date3", "Date4", "Date5"],
)
df
df.transform({'Date1':np.abs, 'Date3':lambda x: x+1, 'Date4' : np.sqrt})
Date1 | Date3 | Date4
0 | 401 | 21.213203
100 | 101 | 10.000000
0 | 51 | 17.320508
0 | 1 | 10.000000
0 | 1 | 0.000000
Cool stuff! Thanks @samukweku for figuring this out 😄 🎉!
With transform_column
, looks like we're covering the case where we want to selectively transform a bunch of columns while retaining anything that isn't explicitly passed in. This is a value-add on top of what df.transform
does. We are also able to transform the column into a new column this way too. The semantics of transform
are such that the resulting dataframe columns are taken from the keys passed in, which I think might be a bit too restrictive for the broader use case that transform_columns
solves.
I'd like to propose something along this lines then:
identity = lambda x: x
# Just a sketch, implementation needs to be fleshed out
def transform_columns(df, function_mapping: Dict, new_column_names: Dict):
# First off, make a function mapping with "identity" as the default
functions = {c: identity for c in df.columns}
functions.update(function_mapping)
# Next, transform the dataframe.
new_df = df.transform(functions)
# Next, rename the columns based on what's in new_column_names
new_df = new_df.rename(new_column_names, axis="columns")
# Finally, put back any columns from the original that are not intended to be overwritten.
new_df = ... # something... mind is blank on this
return new_df
This is kind of a nice programming puzzle to figure out, I think. It might also help eliminate future "semantics debt", where the semantics of a function get muddied over time. The implementation is also flat ("flat is better than nested"), delegates to pandas what ought to be delegated to pandas, and should be more maintainable in the long-run.
The only downside is that this is an API change, so I evolving the API would mean we should preserve the API now, add the new API as keyword arguments, add an API deprecation warning, mark the old chunk of code to be deprecated using comments, and only in the 1.x release deprecate the old code.
What do you all think?
@ericmjl do you mind sharing an example, so I can understand better your proposition? Confused about identity
and the function mapping
Yep, definitely @samukweku!
I posted something here. Please let me know what you think of it!
cc: @sallyhong
Thanks for sharing the example! Helps a lot! (Quick note: Cell6, the column rename didn't work. I think you have to put 'date4'--no caps)
Just wanted to confirm, for the second method style, how would you "transform" multiple columns when you want to rename some columns but not the other? (That's the main reason I like the first method style.)
I also want think we should make sure that we preserve the column order of the original dataframe in the output. (Cell 8 output shows otherwise)
As a side note, what if we used a new name mutate
/transmute
(nothing original in those names), that way we avoid the single/plural verb names, while still using a synonym that captures what we mean?
Just wanted to confirm, for the second method style, how would you "transform" multiple columns when you want to rename some columns but not the other? (That's the main reason I like the first method style.) I also want think we should make sure that we preserve the column order of the original dataframe in the output. (Cell 8 output shows otherwise)
@sallyhong I tried an alternative implementation, leveraging another function we have in the library. Check it out on this gist.
As a side note, what if we used a new name mutate/transmute (nothing original in those names), that way we avoid the single/plural verb names, while still using a synonym that captures what we mean?
@samukweku that's a cool idea. I think we might be hitting on the semantics of R's mutate
.
What do both of you think about the 2nd implementation (transform_column_v3
) in the gist? @sallyhong, would you like to give it a shot, sort of as a way of getting back into contributing? 😄
Hi, just assigned this to myself. I'll take a shot at this, this weekend! I want to study the R equivalent too to glean any other ideas.
Thanks!
Hi!
I attempted a v4.
I put some scenarios at the bottom. Please take a look and I'd love to hear feedback from anyone (about anything!) Thanks :)
@sallyhong I like the ideas. Just thinking about the append_new
argument; it makes the transform_columns
function more like an assign
function. My thought is that the transform
function should just transform(mutate) the columns, and not create new columns. It also seems like a good option. Just my opinion, others might have a different opinion.
Very good point @samukweku . I was wrestling with this as R has two functions (mutate
vs mutate_at
) depending on whether or not you want to create a new column or overwrite the old column.
I decided that I didn't like the idea of changing a column contents without knowing that I did (e.g. providing a transform_dict
with append_new=False
, and wanted it as a default to clearly indicate which columns were changed.
Or maybe we can just raise a warning if a user changes a column without a name and suggest they use df.assign
for those columns instead? Not sure, but I'm not too attached to my code so I'm happy to change/update it :)
@sallyhong I like what you've done there! I am also wondering, what if the default suffix was not a fixed label, but instead the __name__
attribute of the function passed in? This would be magical :tada: for users. For functions that are anonymous (they don't have a __name__
attribute), such as lambda
s, the _new
can be the baseline default.
Thanks @ericmjl. I never knew of the __name__
attribute. Let me do some research and learn about this!
@ericmjl , great idea! I tried it out with a test function too (named "add_two")
https://gist.github.com/sallyhong/627db0810ee917df0f978b8557245a61