pandas icon indicating copy to clipboard operation
pandas copied to clipboard

API: reconsider returning read-only arrays from DataFrame/Series .array/.values/__array__

Open jorisvandenbossche opened this issue 1 month ago • 6 comments

Context: during the implementation of the Copy-on-Write feature (https://github.com/pandas-dev/pandas/issues/48998), there was the idea to make returned arrays read-only for APIs that return underlying arrays (.values, to_numpy(), __array__).

This was initially only done for numpy arrays (the first two PRs), and recently also for columns backed by ExtensionArrays (both for when returning an EA (.values / .array) or returning the EA as a numpy array (to_numpy(), __array__)):

  • https://github.com/pandas-dev/pandas/pull/51082
  • https://github.com/pandas-dev/pandas/pull/53704
  • https://github.com/pandas-dev/pandas/pull/61925

The idea behind returning a read-only array is as follows: with Copy-on-Write, the guarantee we provide is that mutating one pandas object (Series, DataFrame) doesn't update another pandas object (whose data is shared as an implementation detail). But users can still easily get a viewing numpy array, and mutate that one. And at that point, we don't have any control over how this mutation propagates (it might update more objects than just the one from which the user obtained it, for example if other Series/DataFrames were sharing data with this object with CoW).

Example to illustrate this:

# creating a dataframe and a derived dataframe through some operation
# (that in this case didn't need to copy)
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> df2 = df.sort_values(by="a").reset_index()

# getting a column and mutating this -> CoW gets triggered and only `ser` is changed, not `df`
>>> ser = df["a"]
>>> ser[0] = 100
>>> ser
0    100
1      2
2      3
Name: a, dtype: int64
>>> df
   a  b
0  1  4
1  2  5
2  3  6

# however, when the code is mutating the numpy array it got from the series (or dataframe)
# (though .values, or np.asarray(ser), etc), then even the derived `df2` is silently mutated
>>> ser = df["a"]
>>> arr = ser.values
>>> arr.flags.writeable = True  # <-- this is now needed because we made .values readonly
>>> arr[0] = 100
>>> df2
   index    a  b
0      0  100  4
1      1    2  5
2      2    3  6

Right now, with returning read-only arrays, I have to include arr.flags.writeable = True to make this work (otherwise the above example would raise an error in arr[0] = 100 about the array being read-only).

But if we didn't make the returned arrays read-only, this would work, and such mutations of the underlying numpy array would propagate unpredictably to other pandas series/dataframe objects.

jorisvandenbossche avatar Nov 12 '25 18:11 jorisvandenbossche

Discussed in this week's dev meeting. The issue @jorisvandenbossche mentioned is in geopandas where they do something like ser.array.attrname = value and find that with the read-only behavior (in main but not a released version) that this does not propagate to the actual underlying array (ser._mgr.blocks[0].values). One solution to this particular problem would be to set the attribute on ser._values, but in general we don't want third parties messing with private attributes (we could make an exception for geopandas bc they are special).

I have always been skeptical of making .values/.array/__array__ a read-only view so am fine with reverting. No one else in the meeting expressed an opinion.

The path that I'm now advocating is to revert for now and see if people complain about the lack of read-only-ness. If they do, we can quickly do a 3.0.1. Meeting participants reacted non-verbally to this suggestion in a way that I interpreted as loose agreement.

jbrockmendel avatar Nov 14 '25 17:11 jbrockmendel

@jbrockmendel Thanks for the explanation. If this is ready to move forward, I’d be happy to take a look and open a PR to revert the read-only change as you suggested. I may have a few questions if I get stuck along the way, hope that’s okay. Please let me know if i should proceed.

Aniketsy avatar Nov 21 '25 07:11 Aniketsy

Not until there's consensus that this change is worth reverting.

jbrockmendel avatar Nov 21 '25 16:11 jbrockmendel

(updated the top post with more context for the original reason doing this)

I was thinking we could also do a more partial revert, i.e. for now just revert for the cases where we return an EA, but keep read-only for numpy arrays (since those don't really have mutable state one might update). And we can still do a fuller revert if we get too much complaints about the readonly numpy arrays.

Of course, that creates some inconsistency between EA vs np.ndarray (or ndarray you get from an EA-backed series vs numpy-dtype backed series, this one is probably the more annoying inconsistency)

jorisvandenbossche avatar Nov 26 '25 17:11 jorisvandenbossche

While chatting with @rhshadrach, I had another thought how to solve (longer term) the issue that geopandas is facing, see https://github.com/pandas-dev/pandas/issues/63215 (for short term back compat, reverting the returned EA being a view would still be good)

jorisvandenbossche avatar Nov 26 '25 22:11 jorisvandenbossche

@jorisvandenbossche - with #63212, close this or at least take it off the 3.0 milestone?

rhshadrach avatar Dec 02 '25 22:12 rhshadrach