pyjanitor
pyjanitor copied to clipboard
Inconsistencies in original-dataframe mutation
Old is re: reorder_columns: Does not mutate original DataFrame. I'm thinking about modding it to do so to be consistent with everything else I implemented.
Edit:
In working on the Jupyter Notebook example walkthrough for pyjanitor, I'm noticing some inconsistencies regarding whether the original DataFrame is changed after an operation in the provided example. My notes:
.clean_names()does not mutate.remove_empty()does.rename_column()does not.coalesce()does not.encode_categorical()does.convert_excel_date()does
What do we think about this?
@ericmjl @szuckerman - I updated this post and would like your thoughts.
Related to #76
@zbarry I'm glad you brought this up, this has also been on my mind for a while.
I'm in favour of standardizing, but like the inplace=True/False issue, it means we may end up having to place a burden on new contributions to ensure that the functions conform to what we decide now. Both standardizing on mutating the original and not mutating the original will result in technical debt. We just have to decide which one we want more. Unlike the inplace=True issue, this is one that is related to the performance of pyjanitor, rather than the public-facing API.
Let me write down some of my thoughts at this point.
As we all probably know, standardizing on not mutating the original dataframe probably means that we will likely incur a memory and performance cost for large dataframes. Every function would then have to explicitly copy the original dataframe and only make modifications on the modified one.
Standardizing on mutating the original dataframe might be technically more long-winded, which makes development just a teeny bit more difficult. For example, I'm not quite sure how to implement .coalesce() in a way that mutates the original. Maybe I'm not well-versed in the pandas API.
In terms of what users will notice, speed is definitely a thing. However, I think of the pandas userbase as being pareto distributed, with 20% dealing with @zbarry's scale of data. The other 80% deal with in-memory data, which means that performance is rarely going to be an issue; the pandas API would be the main blocker, not speed of code.
@szuckerman, do let us know what you think here. With 3 of us, we won't need a tiebreaker vote. I'm happy to go both ways, though my current inclinations lean towards a lazier form of development.
Happy to pick up the slack re: technical debt ;)
Edit: also, my data is all in memory as well. Speed is definitely a problem in this regime.
@ericmjl @szuckerman
Alright, I had some inspiration last night and think I might have come up with a user-friendly, explicit approach to operating in place when working with pyjanitor functions.
To somewhat sum up the previous discussion:
- It's really confusing that some janitor functions mutate the original dataframe, and others do not
- For contributors, low barrier-to-entry to implementing their desired functionality is beneficial - if they don't think in place for speed / efficiency is necessary, one has to consider the extra burden enforcing in place by default might impose on them
- For high-performance applications, dataframe copying is a big performance problem
- In my own experience, I've accidentally mutated dataframes I did not mean to. "Explicit is better than implicit" is not happening here.
Proposed solution
The pyjanitor modus operandi would be as follows:
- By default, no pyjanitor functions touch the original dataframe
- Users must specify that they want pyjanitor operations on a dataframe to happen in place
- In place operations are meant for computational efficiency only. There is no guarantee that a method chain has returned the original dataframe. This lack of guarantee removes a bit of technical debt that would force us to make all functions have the capability to operate in place.
So how can this be done in a clean way?
Implementation
Two functions are implemented: a decorator mutates_dataframe and a pf-registered function operate_inplace. The former indicates that a pyjanitor function modifies the data in the original dataframe. Any function that modifies data must be wrapped in the decorator. The purpose of the decorator is to check if the user has specified whether they want to operate in place whenever possible when they use pyjanitor functions. If the user has not specified this, yet the given pyjanitor function does indeed operate on the original data, the decorator copies the dataframe before piping it into the actual implementation. This will enforce that the original data is never touched by pyjanitor unless the user very explicitly wants it to be.
The decorator is implemented as follows:
def mutates_dataframe(func):
"""
A decorator for functions that mutate dataframes.
In the case where in place operations are not desired,
this function will make a copy of the dataframe before
executing your desired pyjanitor function.
"""
@wraps(func)
def wrapper(df, *args, **kwargs):
try:
if not df._janitor_operate_inplace:
df = df.copy()
except AttributeError:
df = df.copy()
return func(df, *args, **kwargs)
return wrapper
The operate_inplace pandas flavor registered function will be called by the user as the first step in a chain when they want pyjanitor to use inplace=True or equivalent as much as possible. Again, no guarantee that copying never happens. operate_inplace takes an optional kwarg make_copy (currently by default False) which is a one-call version of df.copy().operate_inplace() when the user would like the computational benefits of in place operations without mutating the original data. Since by default, pyjanitor does not mutate the original dataframe, it might actually make more sense to keep with this paradigm when operating in place and have the user again explicitly specify that you don't care if there is a chance the original data is modified (for an additional memory / performance advantage).
operate_inplace is implemented as:
@pf.register_dataframe_method
def operate_inplace(df, make_copy=False):
if make_copy:
df = df.copy()
df._janitor_operate_inplace = True
return df
All this function does besides the copying if desired is to set a flag that the decorator checks to see if it should copy the dataframe before calling the actual function implementation.
Now, given the pyjanitor example pipeline, the following is guaranteed to not modify df:
df = pd.read_excel('examples/dirty_data.xlsx')
cleaned_df = (
df
.clean_names()
.remove_empty()
.rename_column("%_allocated", "percent_allocated")
.rename_column("full_time_", "full_time")
.coalesce(["certification", "certification_1"], "certification")
.encode_categorical(["subject", "employee_status", "full_time"])
.convert_excel_date("hire_date")
.reset_index_inplace(drop=True)
)
For increased speed, the user can now explicitly specify to use in place ops whenever possible:
df = pd.read_excel('examples/dirty_data.xlsx')
cleaned_df = (
df
.operate_inplace()
.clean_names()
.remove_empty()
.rename_column("%_allocated", "percent_allocated")
.rename_column("full_time_", "full_time")
.coalesce(["certification", "certification_1"], "certification")
.encode_categorical(["subject", "employee_status", "full_time"])
.convert_excel_date("hire_date")
.reset_index_inplace(drop=True)
)
Demo
Find my notebook demonstration of the concept here, where I'm taking the demo from the pyjanitor notebook walkthrough and modifying all the pyjanitor functions (I don't import janitor here; I just copy the necessary functions into the notebook) to have an in place mode.
There's a ton of good ideas here, and I really like the .operate_inplace() solution.
Just my own 2 cents:
I think that everything should mutate inplace by default, but also return the DataFrame.
This way, the following code snippets are similar:
df = (df.clean_names()
.remove_empty())
df.clean_names()
df.remove_empty()
I personally find the latter more readable, but understand how most code is written in the former way.
If people do not want to mutate their DataFrames inplace, they could just do this, similar to the .operate_inplace() suggestion:
df = (df.copy()
.clean_names()
.remove_empty())
df2=df.copy()
df2.clean_names()
df2.remove_empty()
I think it's important for Pandas to have arguments whether to mutate in place or not since there are many situations where users are working with Pandas interactively and inadvertently messing up data could be a problem in the middle of an analysis.
I look at pyjanitor as being more like a SQL query. It's run at the beginning of an analysis and then one interactively performs EDA or other statistics on the data. In that case, we don't have to be as careful about returning in place or not. If someone messes their data up, they just run the pyjanitor cell again as they would with a SQL pull.
Also, there's some functions like df.add_column('a', value) that don't make sense if they don't mutate in place. One would have to do df = df.add_column('a', value) which isn't much different from df['a'] = value. Just having df.add_column('a', value) is cleaner.
Sorry for delay; on vacation.
Definitely agree that df.func() is a lot cleaner than df = df.func(). Something I'm not sure about, though, is whether it is even possible with pandas to make sure all the functions we (will) provide in pyjanitor mutate the original dataframe. In the case where a function cannot mutate the df, you now have an exception to the "df.func() instead of df=df.func()" rule that users will have to memorize (which was the originating issue behind this discussion).
As @jcmkk3 pointed out, inplace= is going away... Functions like clean_names which rely on DataFrame.rename will no longer have the ability to then operate on the original dataframe object, further hampering an effort to make sure all janitor functions are guaranteed to mutate it.
As a side note, if we're interested in going with the function call to indicate to avoid copying as much as possible, we might want to consider changing the name of the proposed operate_inplace method, then.
I look at pyjanitor as being more like a SQL query. It's run at the beginning of an analysis and then one interactively performs EDA or other statistics on the data. In that case, we don't have to be as careful about returning in place or not. If someone messes their data up, they just run the pyjanitor cell again as they would with a SQL pull.
@szuckerman This has been my pattern of usage as well. I load in data, do some preprocessing, and then use the data in some other downstream modelling. The biggest win for me has been in interactively shaping the data until I get it right. Later on, when making presentations on how the data were preprocessed, I'm able to read off the data preprocessing steps pretty easily so I don't miss any key steps in my talks.