pandas
pandas copied to clipboard
ENH/PERF: copy=True keyword in methods where copy=False can improve perf
xref #48117, #47993, #48043, #47934, #47932
We have a number of methods that currently make copies but don't have to, mainly methods that revolve around index/columns while leaving the data untouched. e.g. DataFrame.droplevel. By adding a keyword copy=True to these methods, we give users the choice to avoid making a copy by setting copy=False, thereby improving performance.
@jorisvandenbossche makes a reasonable point https://github.com/pandas-dev/pandas/pull/48117#pullrequestreview-1077970317 that adding this may interact with potential copy/view changes in upcoming versions.
Besides droplevel and drop, for which PRs are already open, the other methods I'm looking at are reset_index, replace, and fillna (there may be others I haven't found). In all three of those cases, the idea would be to add copy and deprecate inplace (xref #16529).
Should we revert https://github.com/pandas-dev/pandas/pull/48043, https://github.com/pandas-dev/pandas/pull/47934, https://github.com/pandas-dev/pandas/pull/47932 in the meantime?
On the PR I said:
And that change of behaviour is quite subtle (it's not something that we can easily deprecate about for those specific cases). I would really like to avoid that more people start to rely on the mutability of dataframes returned in such a
copy=Falsecase.
On second thought, this is not fully true. We could deprecate the copy=False behaviour, by just deprecating the full keyword (or specific option). But even then it is still the case that we are adding a new keyword that we would deprecate again shortly, which IMO also wouldn't be ideal.
ok i agree that this could be a can of worms and we have one change to do this right so will propose
- let's revert the changes adding copy= for 1.5
- let's inventory (create an issue) all methods and see where we have copy= and/or inplace= keywords
- after 2.0 i think we should deprecate inplace= and add copy=
We could deprecate the copy=False behaviour, by just deprecating the full keyword (or specific option).
I've been assuming that, since we have existing methods with copy=True keywords, that we will end up making a decorator to deprecate those if/when CoW makes them redundant. If so, the marginal cost (on the dev side) to deprecating extra methods would be low.
But even then it is still the case that we are adding a new keyword that we would deprecate again shortly, which IMO also wouldn't be ideal.
That's fair. My thought going in was a) we an offer improved performance now, while such a deprecation is both future and not-assured, b) all else equal, an explicit copy is a better API than an implicit one (CoW).
If we do add the keyword now(ish), we could also document that there is a decent chance of an upcoming change.
@jorisvandenbossche what if we did the following if/when CoW becomes the default: keep a copy keyword but have the default be lib.no_default, so copy=True makes a copy eagerly, not-specifying-copy defaults to CoW, and copy=False does something like result._mgr.refs = None before returning result? Would result be an actual view in that case? Would that Break The World for CoW?
what if we did the following if/when CoW becomes the default: keep a copy keyword but have the default be lib.no_default, so
copy=Truemakes a copy eagerly, not-specifying-copy defaults to CoW, and copy=False does something likeresult._mgr.refs = Nonebefore returningresult? Wouldresultbe an actual view in that case? Would that Break The World for CoW?
Yeah, based on my current understanding, that would break the CoW logic. Or at least it would introduce hard to predict behaviour. For example, you can set the refs of the resulting dataframe itself to None, but there might still be other objects referencing the same underlying data. So it is not a guarantee that no CoW happens (since we also check if the data is being referenced), or if we avoid CoW, it can break the guarantee of another object not getting mutated.
See also my comment here https://github.com/pandas-dev/pandas/issues/36195#issuecomment-830579242 about the shallow copy concept (i.e. copy=False or copy(deep=False)) not being possible anymore in its current meaning when using CoW.
We might want to add an advanced feature to actually do something like this, at some point if there is demand for that. But if we do so, I think it should be 1) new API (so you don't accidentally get this because you already did copy=False before), and 2) it might be better to expose this in methods that do the actual mutation (so you can say: I know what I am doing, for this mutation, do not do any CoW ref checks)
In general for the copy keyword, we still need to discuss what we actually want to do with this keyword. My current thinking was that we can indeed keep this, and have copy=True do an actual (eager) copy, while copy=False (the default) uses delayed CoW (that's what https://github.com/pandas-dev/pandas/pull/46958 did for the few methods where I already updated this, see https://github.com/pandas-dev/pandas/pull/46958#issuecomment-1130550445).
There can be some value in allowing users to do copy=True (although limited, since you can also use the copy() method afterwards if you want this). Alternatively, we could also use copy=<no_default> instead of copy=False as the default. Or we could also simply remove the copy keyword eventually.
If so, the marginal cost (on the dev side) to deprecating extra methods would be low.
On the dev side yes, but on the user side we would still be advising people to use copy=False now (eg from an inplace deprecation), which they potentially need to change again then, so in which case users would have to update their code twice.
all else equal, an explicit copy is a better API than an implicit one (CoW).
Yes, but that's for copy=True. If we are adding a new copy keyword now in certain methods, it is for users to do copy=False? (since the True case is already the current behaviour)
@jbrockmendel would you have time to revert https://github.com/pandas-dev/pandas/pull/48043, https://github.com/pandas-dev/pandas/pull/47934, https://github.com/pandas-dev/pandas/pull/47932? I was attempting to revert but got lost in all the merge conflicts as I think inplace might or might not supposed to be deprecated somewhere as well
I'll give it a go this afternoon
Yes, but that's for copy=True. If we are adding a new copy keyword now in certain methods, it is for users to do copy=False? (since the True case is already the current behaviour)
Correct, its to give users the option to get the more performant behavior (including internally, see a couple usages in #48117). In particular I have in mind use cases with method chaining where you might do something like df.foo(copy=False).bar(copy=False).baz(copy=False).[...].zoom(copy=True)
In general for the copy keyword, we still need to discuss what we actually want to do with this keyword. My current thinking was that we can indeed keep this, and have copy=True do an actual (eager) copy, while copy=False (the default) uses delayed CoW (that's what https://github.com/pandas-dev/pandas/pull/46958 did for the few methods where I already updated this, see https://github.com/pandas-dev/pandas/pull/46958#issuecomment-1130550445).
Yah this is a tough one. If/when CoW is enabled by default, do you expect there will still be an option to disable it via pd.options? If so, I expect we'd want to keep the copy kwd available for this mode.
i think we should wait to avoid confusion with CoW
then make a single change for 2.0 (or even wait until cow is default)
Opened #48201 to revert #48934. For the other two, we've also deprecated inplace, would need to revert those deprecations if reverting the copy kwd. I'm not wild about this, but will do it if there's consensus.
Sorry for the slow follow-up here.
For the other two, we've also deprecated inplace, would need to revert those deprecations if reverting the copy kwd. I'm not wild about this, but will do it if there's consensus.
I would personally also prefer to revert those, but we would need to indeed revert also the deprecation of inplace. The same arguments as discussed above apply to those methods.
I opened https://github.com/pandas-dev/pandas/pull/48417 for set_index, in case we agree on this path forward.
CoW becoming the default/only behavior seems increasingly likely. Closing.