pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

singular/plural function names and intermediate copies

Open VPerrollaz opened this issue 4 years ago • 7 comments

Brief Description

Not clear if it is really a bug but still a bit weird.

  • in functions.py we have add_column and add_columns. The first one provides a new dataframe, but the second one delegates to the first for each particular column, so produces many intermediates (and I believe unwanted) copies.
  • in functions.py we also have transform_column and transform_columns which I believe have the same problem. Furthermore the docstrings seem to be misleading because transform_column uses DataFrame.assign which produces a new dataframe and transform_columns delegates to the singular version. But the docstrings implies/asserts that they mutate the original dataframe.

VPerrollaz avatar Jun 28 '20 13:06 VPerrollaz

@VPerrollaz thank you for the catch! As it so happens, @samukweku pinged up on a separate issue that also mentioned this (#284).

I think the API can be evolved/unified a bit, as follows:

df.transform_columns(column_name={"new_column_name": function})

Same goes for add_columns:

df.add_columns(column_name=<iterable/value>)

(For add_columns, though, we'd simply be wrapping df.assign, which already does a tremendous job with what we need.)

What are your thoughts here?

Finally, the issue of mutating vs. non-mutating: @zbarry has had use cases where mutating the dataframe (and not producing copies) is a good thing (e.g. large dataframes). I haven't had a good story on how to let an end-user distinguish these easily. Actually, I'm wondering, do you have any ideas here? I'd love to hear them.

ericmjl avatar Jun 28 '20 13:06 ericmjl

A few remarks here.

  • First of all I'm sorry I just checked the issue titles so I missed the previous mentions.
  • It was a bit unclear but I tried to avoid commenting on the API since I see reasonable arguments for very different choices at this point. On one side breaking changes are not really good and for one column case the plural form is a bit heavy, on the other the namespace of pandas is already a mess and one goal for janitor might be to unclutter the API (though at this point the we use the same namespace so the argument is a bit moot). One way to go at it might be to look at a reasonably large database of janitor use to check the statistics of utilization. I know there are ways to crawl github for uses of janitor but I'm not clear on how difficult it is.
  • I actually just wanted to point out that the plural version could mutate the internal steps while still not touching the original dataframe, which means not just delegating to the singular functions.
  • For the mutating part, my point was just that while transform_column/transform_columns claim to be mutating in the docstring they are actually not, so we might have to adjust the function or the docstring.
  • As far as "ostensible" mutation is concerned it seems the "optimal" way would be for a copy to occur only on the first step of the method chaining, however at this point all the technical ways I thought to achieve that seem incredibly dangerous. I agree that transparency/consistency for the user is critical here.

VPerrollaz avatar Jun 28 '20 14:06 VPerrollaz

First of all I'm sorry I just checked the issue titles so I missed the previous mentions.

@VPerrollaz no worries at all! :smile: Issues evolve, so it's tough to keep track of them via titles only. In any case, they're now linked.

I actually just wanted to point out that the plural version could mutate the internal steps while still not touching the original dataframe, which means not just delegating to the singular functions.

Thanks for re-clarifying! I definitely missed this point. The function call structure can definitely be improved, though I also think that we might be able to circumvent the problem by evolving the API a bit and deprecating the old behaviour.

Was wondering, would you be open to commenting on the proposed API design above? Do you find it a bit easier to reason about? Or is it more confusing? (I know it follows the groupby semantics a bit more closely.)

ericmjl avatar Jun 28 '20 17:06 ericmjl

  • I feel that for add_columns what you propose is very good and pretty close to what we already have, the only "problem" I have is the fact that the **kwargs is pretty unhelpful with autocompletion (I feel a bit like a monomaniac writing that :-) )
  • I have more mixed feeling for transform_columns, while the singular is non ambiguous, the plural one feels more difficult to me (in general not so much the API you propose) because the very general scenario corresponding to the name would be the following one for me
df.transform_columns(
    new_col1={(old_col_1_1, old_col_1_2,...old_col_1_n1): function_1}, 
    new_col2={(old_col_2_1, old_col_2_2,...old_col_2_n1): function_2}, 
    ..., 
    new_colm={(old_col_m_1, old_col_m_2,...old_col_m_nm): function_m}
)

where function_1 takes n1 columns as arguments, ..., and function_m takes nm columns. This starts to be much more complicated though and of course it degenerates in two subscenario:

  • n old columns, n functions taking each one column as argument thus n new columns
  • n old columns, one function taking n columns as arguments thus one new column.

VPerrollaz avatar Jun 28 '20 20:06 VPerrollaz

I see! Observing the general case API that you've written above, I see where we differ now. You would put new_column as the kwarg, while I would put existing_column in as the kwarg. Indeed, things get fancy/complicated if we try to "matrixify" the entire piece.

I think the 80% use case is "we just have a few columns we need to transform".

Alrighty, some random thoughts penned down. Let's let the idea simmer for a bit.

ericmjl avatar Jun 28 '20 22:06 ericmjl

I believe we can deprecate transform_column and transform_columns functions, as the transform function within pandas covers most use cases. Not seen any use cases that requires us to keep it. The transform function in Pandas is method chainable as well.

samukweku avatar Aug 15 '20 12:08 samukweku

changed my mind! transform_columns has its place; we probably have to agree on keeping just one?

samukweku avatar Dec 11 '21 10:12 samukweku