pandas2
pandas2 copied to clipboard
Optional indexes
The pandas.Index
is fantastically useful, but in many cases pandas's insistence on always having an index gets in the way.
Usually, it can be safely ignored when not relevant, especially now that we have RangeIndex
(which makes the cost of creating the index minimal), but this is not always the case:
- The indexing and join behavior of default
RangeIndex
is actively harmful. It would be better to raise an error when implicitly joining on an index between two datasets with a default index. - When converting a DataFrame into other formats, we need an argument (e.g.,
index=True
) for controlling whether or not to include the index.
I propose that we make the index optional, e.g., by allowing it to be set to None
. This entails a need for some rules to handle missing indexes:
- Operations that explicitly rely on indexes (e.g.,
.loc
andjoin
) should raiseTypeError
when called on objects without an index. - Operations that implicitly rely on indexes for alignment (e.g., the DataFrame constructor and arithmetic) now need to handle three cases:
- Index/index operations: These work as before. The result's index has an outer join of the input indexes
- No-index/no-index operations: The inputs have the exact same length (or raise TypeError). The result has no index.
- Mixed index/no-index operations: The inputs must have the same length. The result takes on the index from the input with an index.
Somewhat related: #15
+1, although I think the opposite approach may also be worth consideration.
What if instead of a being a special property of a DataFrame
, an "Index" is just defined by a selection of columns in the frame
- 0 (your
None
) - 1 (today's
Index
) - or more (something like a
MultiIndex
)
This would remove the Index
/ column distinction, which I think is a stumbling block for many. Some discussion here: https://github.com/pydata/pandas/issues/8162
That said, it's not clear to me how a Series
with an Index
fits into this world, and would be a bigger api change.
@chris-b1 Yes, in fact I almost included "indexes as just a special type of column" as part of this issue, but then decided to save it for another one. Since you brought it up (and it's related), we might as well discuss it here.
I also really like this idea, because it's a major stumbling block for both new (and experienced) users.
It does entail a major overhaul of the pandas data model, though, which raises a number of questions. In particular: do we still use an Index
/MultiIndex
for DataFrame.columns
? If so, then it follows that column names should now include the names of index columns.
This raises a big issue with how we handle "messy" data (i.e., non-tidy data). Currently, pandas is a pretty capable tool for such datasets, especially with the ability to use stack
/unstack
columns into hierarchies. But if columns
is a MultiIndex
with multiple levels, adding in index column names is going to make things a mess.
Consider this example adapted from the multi-index docs:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo',],
['one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 6), index=pd.Index(['A', 'B', 'C'], name='letter'), columns=index)
print df
first bar baz foo
second one two one two one two
letter
A 1.131677 3.008499 -1.513677 0.379074 -0.546790 -2.221491
B -1.650027 2.157229 -1.030519 -0.187412 0.711109 -0.334537
C 1.226648 0.631318 0.197816 0.494960 -0.435740 1.098061
Do we now need to add in an extra level to the multi-index for the index column name (e.g., letter
)? Or do we disallow a MultiIndex for column names altogether and use an index of tuples instead?
In general, I don't think it's worth a huge amount of effort to make it easy work with such data, given how much nicer tidy data is and the existence of multi-dimensional alternatives with a cleaner data model in the form of xarray, but such change would certainly going to break a non-negligible number of workflows. Making indexes optional would be a much less ambiguous win.
That said, it's not clear to me how a Series with an Index fits into this world
I think it could work in roughly the same way it currently does. Pulling a column out of a DataFrame would return Series object, associating with it any indexes on the frame.
I'll put some thought into this when I have a chance, but: one possibility to consider is exposing a more primitive pandas.Table
to the user, as a "DataFrame without the Index".
- Referencing a column would give a
pandas.Array
. - Combining an Array plus an Index you obtain a Series
- Combining an Index with a Table produces a DataFrame
The Table would be more like an R data.frame / data frames.
Would the Table
have the same functionality as a DataFrame
? I.e., queries, joins, aggregations, IO, etc?
One thought would be to equip Table
with the most essential relational algebra and manipulations (add/remove columns, etc.) but make everything deferred (the deferred table DSL I designed for Ibis is one example of such a language that has effectively 1-1 parity with SQL, could provide some inspiration)
One possibility to consider is exposing a more primitive
pandas.Table
to the user, as a "DataFrame without the Index".
I'm a really big fan of the Table
data structure. It could do enough for most users and client libraries, and DataFrame
could be left for those who need indexing and alignment (which are important but niche use cases).
See also the datascience package for teaching introductory data science in Python.
On the other hand, the downside is that now we have two similar core data structures for tabular data. With the dynamic nature of Python, this could easily lead to confusion, and also twice the API to maintain.
One thought would be to equip
Table
with the most essential relational algebra and manipulations (add/remove columns, etc.) but make everything deferred
I'm all for deferred APIs, but I'm less sure that this makes sense for the DataFrame
/Table
distinction. It dilutes the message of Table as "DataFrame without the Index".
I definitely like the idea of a base Table
and then a DataFrame
that adds an index. Other than indices, there doesn't need to be a distinction in terms of functionality.
@shoyer I agree that having a deferred API as a separate beast would be better, and making the basic table a pared down, indexless DataFrame (with all operations eagerly evaluated). Was just curious what you all thought =)
I am probably going to make indexes optional in the next version of xarray (pydata/xarray#1017). (Note that in xarray, an index already is basically just a special kind of column, but currently we always generate an index like range(n)
.) I guess we'll see how the transition goes, but I am tentatively very optimistic about it.
It occurs to me that an additional virtue of optional indexes is that it could allow us to further cleanup DataFrame.__getitem__
with sane mixing between label and position based indexing, because we can differentiate between intentional integer indexes and no index at all. I'll elaborate over in #22.