pyjanitor
pyjanitor copied to clipboard
singular/plural function names and intermediate copies
Brief Description
Not clear if it is really a bug but still a bit weird.
- in functions.py we have
add_column
andadd_columns
. The first one provides a new dataframe, but the second one delegates to the first for each particular column, so produces many intermediates (and I believe unwanted) copies. - in functions.py we also have
transform_column
andtransform_columns
which I believe have the same problem. Furthermore the docstrings seem to be misleading becausetransform_column
uses DataFrame.assign which produces a new dataframe andtransform_columns
delegates to the singular version. But the docstrings implies/asserts that they mutate the original dataframe.
@VPerrollaz thank you for the catch! As it so happens, @samukweku pinged up on a separate issue that also mentioned this (#284).
I think the API can be evolved/unified a bit, as follows:
df.transform_columns(column_name={"new_column_name": function})
Same goes for add_columns
:
df.add_columns(column_name=<iterable/value>)
(For add_columns, though, we'd simply be wrapping df.assign
, which already does a tremendous job with what we need.)
What are your thoughts here?
Finally, the issue of mutating vs. non-mutating: @zbarry has had use cases where mutating the dataframe (and not producing copies) is a good thing (e.g. large dataframes). I haven't had a good story on how to let an end-user distinguish these easily. Actually, I'm wondering, do you have any ideas here? I'd love to hear them.
A few remarks here.
- First of all I'm sorry I just checked the issue titles so I missed the previous mentions.
- It was a bit unclear but I tried to avoid commenting on the API since I see reasonable arguments for very different choices at this point. On one side breaking changes are not really good and for one column case the plural form is a bit heavy, on the other the namespace of pandas is already a mess and one goal for janitor might be to unclutter the API (though at this point the we use the same namespace so the argument is a bit moot). One way to go at it might be to look at a reasonably large database of janitor use to check the statistics of utilization. I know there are ways to crawl github for uses of janitor but I'm not clear on how difficult it is.
- I actually just wanted to point out that the plural version could mutate the internal steps while still not touching the original dataframe, which means not just delegating to the singular functions.
- For the mutating part, my point was just that while
transform_column
/transform_columns
claim to be mutating in the docstring they are actually not, so we might have to adjust the function or the docstring. - As far as "ostensible" mutation is concerned it seems the "optimal" way would be for a copy to occur only on the first step of the method chaining, however at this point all the technical ways I thought to achieve that seem incredibly dangerous. I agree that transparency/consistency for the user is critical here.
First of all I'm sorry I just checked the issue titles so I missed the previous mentions.
@VPerrollaz no worries at all! :smile: Issues evolve, so it's tough to keep track of them via titles only. In any case, they're now linked.
I actually just wanted to point out that the plural version could mutate the internal steps while still not touching the original dataframe, which means not just delegating to the singular functions.
Thanks for re-clarifying! I definitely missed this point. The function call structure can definitely be improved, though I also think that we might be able to circumvent the problem by evolving the API a bit and deprecating the old behaviour.
Was wondering, would you be open to commenting on the proposed API design above? Do you find it a bit easier to reason about? Or is it more confusing? (I know it follows the groupby semantics a bit more closely.)
- I feel that for
add_columns
what you propose is very good and pretty close to what we already have, the only "problem" I have is the fact that the**kwargs
is pretty unhelpful with autocompletion (I feel a bit like a monomaniac writing that :-) ) - I have more mixed feeling for
transform_columns
, while the singular is non ambiguous, the plural one feels more difficult to me (in general not so much the API you propose) because the very general scenario corresponding to the name would be the following one for me
df.transform_columns(
new_col1={(old_col_1_1, old_col_1_2,...old_col_1_n1): function_1},
new_col2={(old_col_2_1, old_col_2_2,...old_col_2_n1): function_2},
...,
new_colm={(old_col_m_1, old_col_m_2,...old_col_m_nm): function_m}
)
where function_1
takes n1
columns as arguments, ..., and function_m
takes nm
columns.
This starts to be much more complicated though and of course it degenerates in two subscenario:
-
n
old columns,n
functions taking each one column as argument thusn
new columns -
n
old columns, one function takingn
columns as arguments thus one new column.
I see! Observing the general case API that you've written above, I see where we differ now. You would put new_column
as the kwarg, while I would put existing_column
in as the kwarg. Indeed, things get fancy/complicated if we try to "matrixify" the entire piece.
I think the 80% use case is "we just have a few columns we need to transform".
Alrighty, some random thoughts penned down. Let's let the idea simmer for a bit.
I believe we can deprecate transform_column
and transform_columns
functions, as the transform
function within pandas covers most use cases. Not seen any use cases that requires us to keep it. The transform
function in Pandas is method chainable as well.
changed my mind! transform_columns
has its place; we probably have to agree on keeping just one?