pandas2
pandas2 copied to clipboard
Copy on write for views
I will work on a full document for this to get the conversation started, but this can be the placeholder for our discussion about COW
As nice as copy-on-write would be, it's not strictly necessary in pandas 2.0 because we can choose our own consistent rules for copying once we divorce our storage from NumPy.
For example, we could say:
- Any indexing operation on columns uses views.
- Any indexing operation on rows makes a copy (all indexing operations on Series make a copy).
Given that we plan to ditch the BlockManager
anyways, we would get (1) basically for free.
I'm sure there are a few use cases for view based slicing of DataFrame rows, but these are quite niche in comparison to selecting columns, and in my opinion, the unpredictability it introduces into the data model is not worth the trouble.
Copy on write for column views (and eventually, maybe row slicing) would still be nice in making pandas more intuitive, but could possibly wait until a later 2.1 or 3.0 release (supposing we're doing semantic versioning).
I agree COW isn't a strict necessity for the 1 -> 2 transition. I think it's worth keeping in mind during the development process as there's a number of things we can do to make adding it later easier or more difficult. Step 1 is keeping track of parent-child relationships in a lightweight way, and we can permit mutation to start in accordance with current behavior
See discussion in https://github.com/pydata/pandas/pull/11500
I've expressed my views on COW in pretty extensive detail elsewhere (#10954), so I'll save everyone the trouble of repeating them all here, but in short: any behavior that's consistent and easy to understand is fine by me!
Have we abandoned trying to get this in before v1.0?
It's probably not too likely, since it would be an API change that would take a little time to fully understand the impact. If anyone has other thoughts (separate from the behavior of C-O-W) on this please chime in
A notable benefit of copy-on-write is that operations like reset_index
become zero-copy operations.