pandas2 icon indicating copy to clipboard operation
pandas2 copied to clipboard

Revisit Series implicit size mutability and implicit type conversions

Open wesm opened this issue 8 years ago • 6 comments

I haven't been able to wrap my head around these behaviors:

In [10]: s = pd.Series()

In [11]: s['a'] = 7

In [12]: s
Out[12]: 
a    7
dtype: int64

or

In [2]: s = pd.Series([1, 2, 3])

In [3]: s['4'] = 'b'

In [4]: s
Out[4]: 
0    1
1    2
2    3
4    b
dtype: object

On first principles, I think these should raise KeyError and ValueError/TypeError, respectively. I'm concerned that preserving these APIs (especially implicit type changes) is going to be problematic for pandas 2.0 if our goal is to provide more precise / consistent / explicit behavior, particularly with respect to data types. If you want to change the type of something, you should cast (unless there is a standard implicit cast, e.g. int -> float in arithmetic). In the case of implicit size mutation, it seems like a recipe for bugs to me (or simply working around coding anti-patterns -- better to explicitly reindex then assign values).

Let me know other thoughts about this.

wesm avatar Sep 27 '16 18:09 wesm

Personally, I'd let the type conversion on mutation go away completely, I don't see any upside to that, especially with a first class NA value that doesn't force casts.

More ambivalent on setting with expansion - sometimes that is useful (or at least convenient), e.g., below with a MultiIndex would be kind of annoying to express with a reindex? Granted this isn't very idiomatic and allowing this does add plenty of complexity to the indexing code.

In [27]: df = pd.DataFrame({'a': [1, 2, 3, 4]},
    ...:    index=pd.MultiIndex.from_product([['a', 'b'], ['k', 'd']]))

In [28]: idx = pd.IndexSlice

In [29]: df.loc[idx['a','j'], :] = 55

In [30]: df
Out[30]: 
        a
a k   1.0
  d   2.0
b k   3.0
  d   4.0
a j  55.0

chris-b1 avatar Sep 27 '16 18:09 chris-b1

I'm strongly supportive here -- this is a variation of #21.

If we disable automatic reindexing in __setitem__, for consistency I think we should do the same for __getitem__, too. We've been quite happy with only an explicit .reindex() method for doing reindexing in xarray.

shoyer avatar Sep 27 '16 20:09 shoyer

@shoyer IIUC xarray is mostly immutable, so this is quite different than pandas, where the core objects are mutable by-default / definition. In any event I agree we have too much magic currently on implicit type conversions.

However there is a special case that needs addressing. In particular, even though this is a discouraged pattern, its used a lot. e.g.

In [3]: s = Series()
   ...: s['a'] = 2

In [4]: s
Out[4]: 
a    2
dtype: int64

so an empty Series really can't default to dtype, and needs to be coerced on first fill. IOW, do we have a concept of .empty? This is further compilcated by residual dtypes from removals, IOW.

so this is clear

In [1]: s = Series([1,2,3])

In [2]: s[0:0]
Out[2]: Series([], dtype: int64)

but is this the same as Series()? or is Series() a dtype of 'empty' (or maybe 'any')? IOW, what happens if I now add an element to [2]?

jreback avatar Sep 27 '16 22:09 jreback

@shoyer IIUC xarray is mostly immutable, so this is quite different than pandas, where the core objects are mutable by-default / definition. In any event I agree we have too much magic currently on implicit type conversions.

Xarray is immutable in the same way proposed here for pandas: DataFrame columns can be mutated and existing rows can be mutated, but rows cannot be inserted or removed. This requires users to do things in a reasonable efficient way, because, like pandas, data is ultimately stored in non-resizeable arrays.

Implicit size mutability is problematic because a naive user might expect that data is backed by a data structure that can be efficiently resized, like Python's list, but this isn't the case (the .append method is problematic for similar reasons). Now that we are redoing the guts of pandas in C++, we could actually store data in resizeable arrays, but on the other hand that does entail some additional complexity and memory overhead.

shoyer avatar Sep 27 '16 22:09 shoyer

The fact that this is used a lot tells me there is some API deficiency:

In [3]: s = Series()
   ...: s['a'] = 2

In [4]: s
Out[4]: 
a    2
dtype: int64

Series is not a dict and was never intended to be fully substitutable for one -- the lack of an empty method may be the culprit. We should look more at this

wesm avatar Sep 29 '16 13:09 wesm

related: https://github.com/pandas-dev/pandas/issues/9738

jreback avatar May 15 '17 10:05 jreback