pyjanitor [ENH] Smarter use of dataframe copying

Right now, we're copying entire dataframes to avoid mutating the original in a pyjanitor function. This of course can come with big computational costs depending on the size of the dataframe involved. We've talked about how to mitigate this problem a bit, and for now, we've been going with the "copy everything" option. The possibly unanswered question here, though, is what are we actually trying to preserve with the copy? In some cases, a pyjanitor function only acts on dataframe metadata, and not the data contained within. In such cases, protecting the data through a copy might not be entirely necessary.

I realized that df.copy() has a shallow copy action:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html

I think that there could be a good amount of cases in pyjanitor where what we really want to guarantee preservation of is the structure of the input dataframe. The default copy strategy we've been using unlinks the data between the function's input and output dataframes. Is that really important when you're cleaning the column names? .copy(deep=False) will result in both dataframes pointing to the same underlying data and avoid unnecessary copying.

Oct 18 '19 20:10 zbarry

Hi, I look to this repo for examples of how to use the pandas_flavor, and I've been reading your comments/posts about the inplace vs SQL query mentality. It's difficult to determine how/when to use inplace or create a copy and the behavior is inconsistent. Does pandas_flavor always require a copy/modifying the original dataframe? Is there a way to avoid this in order to save memory? I would simply like to do dataframe.drop_empty_rows() and affect the original dataframe.

import pandas_flavor

# Does not replace original dataframe
@pandas_flavor.register_dataframe_method
def drop_empty_rows(dataframe):
    return dataframe.dropna(axis=0, how='all')
 
# Should replace original dataframe, but doesn't in my tests.
@pandas_flavor.register_dataframe_method
def drop_empty_rows(dataframe):
    dataframe = dataframe.dropna(axis=0, how='all')
    return dataframe

# Should replace original dataframe -- untested.
@pandas_flavor.register_dataframe_method
def drop_empty_rows(dataframe):
    dataframe_processed = dataframe.copy()
    dataframe_processed = dataframe.dropna(axis=0, how='all')
    return dataframe_processed

Jun 21 '21 20:06 DOH-Manada

@DOH-Manada thanks for pinging in! Indeed, we've gone back and forth on this matter for close to two years now.

There are some intricacies to worry about here. An explicit .copy() of a dataframe may result in badly behaved code w.r.t. memory usage, especially when users have a large dataframe. I think @zbarry wanted .copy() off by default because in his use cases, enforcing copying on all pyjanitor operations would result in pretty bad things.

I think over time, this next idea has crystallized more clearly for me. It starts with making pyjanitor use pandas-native .something() operations wherever possible. In doing so, we never end up with a situation where we use the selector syntax (df[selector]) to directly mutate the dataframe. The added benefit here is that we also end up leveraging all of pandas memory management capabilities, so we never have to worry about it. In some senses, it's the software developer in me reminding myself to heed Bill Gates' advice - to be strategically lazy :smile:.

So if I were in your position, the first code block would be how I would prefer that pyjanitor handles the implementation of drop_empty_rows.

As for your specific questions:

Does pandas_flavor always require a copy/modifying the original dataframe?

Not at all. pandas_flavor does one and only one thing: registering a function as a dataframe method. It doesn't care about how the internal implementation of a function looks.

Is there a way to avoid this in order to save memory?

As mentioned above, if we simply wrap pandas-native methods into higher-order functions/dataframe methods, then we take advantage of how pandas handles memory.

Finally, I think I did see a mistake in the code blocks - hope you don't mind me pointing them out here to discuss. I think the second example shouldn't replace the original dataframe because when we do df = df.dropna(), I think we're overwriting the original df reference with a new object. Definitely correct me if I'm wrong here.

Also, in the third example, I think you intended to do:

# Should replace original dataframe -- untested.
@pandas_flavor.register_dataframe_method
def drop_empty_rows(dataframe):
    dataframe_processed = dataframe.copy()
    dataframe_processed = dataframe_processed.dropna(axis=0, how='all')
    return dataframe_processed

Please let me know if this addressed your question! I'd also encourage you to try out some memory profiling tools to see how memory usage changes with different implementations. That would be the most definitive thing to check memory usage.

Jun 22 '21 01:06 ericmjl

@ericmjl Thanks for the quick response. Yes, the third example had a typo (can't edit). However, in my second example, my intention is to overwrite the original dataframe, which I believe is similar to pyjanitor's implementation.

    """  # noqa: E501
    nanrows = df.index[df.isna().all(axis=1)]
    df = df.drop(index=nanrows).reset_index(drop=True)

    nancols = df.columns[df.isna().all(axis=0)]
    df = df.drop(columns=nancols)

    return

Ultimately I expected dataframe.perform_some_action() to modifies the original dataframe inplace, but if that's not best practice, I'll simply do dataframe = dataframe.perform_some_action(). Indeed, creating a copy each time definitely slowed things down on my end.

Jun 22 '21 12:06 dhern023

pyjanitor pyjanitor copied to clipboard

[ENH] Smarter use of dataframe copying

pyjanitor
pyjanitor copied to clipboard