pandas2 icon indicating copy to clipboard operation
pandas2 copied to clipboard

Optional indexes

Open shoyer opened this issue 8 years ago • 9 comments

The pandas.Index is fantastically useful, but in many cases pandas's insistence on always having an index gets in the way.

Usually, it can be safely ignored when not relevant, especially now that we have RangeIndex (which makes the cost of creating the index minimal), but this is not always the case:

  1. The indexing and join behavior of default RangeIndex is actively harmful. It would be better to raise an error when implicitly joining on an index between two datasets with a default index.
  2. When converting a DataFrame into other formats, we need an argument (e.g., index=True) for controlling whether or not to include the index.

I propose that we make the index optional, e.g., by allowing it to be set to None. This entails a need for some rules to handle missing indexes:

  • Operations that explicitly rely on indexes (e.g., .loc and join) should raise TypeError when called on objects without an index.
  • Operations that implicitly rely on indexes for alignment (e.g., the DataFrame constructor and arithmetic) now need to handle three cases:
    1. Index/index operations: These work as before. The result's index has an outer join of the input indexes
    2. No-index/no-index operations: The inputs have the exact same length (or raise TypeError). The result has no index.
    3. Mixed index/no-index operations: The inputs must have the same length. The result takes on the index from the input with an index.

Somewhat related: #15

shoyer avatar Sep 07 '16 16:09 shoyer

+1, although I think the opposite approach may also be worth consideration.

What if instead of a being a special property of a DataFrame, an "Index" is just defined by a selection of columns in the frame

  • 0 (your None)
  • 1 (today's Index)
  • or more (something like a MultiIndex)

This would remove the Index / column distinction, which I think is a stumbling block for many. Some discussion here: https://github.com/pydata/pandas/issues/8162

That said, it's not clear to me how a Series with an Index fits into this world, and would be a bigger api change.

chris-b1 avatar Sep 07 '16 17:09 chris-b1

@chris-b1 Yes, in fact I almost included "indexes as just a special type of column" as part of this issue, but then decided to save it for another one. Since you brought it up (and it's related), we might as well discuss it here.

I also really like this idea, because it's a major stumbling block for both new (and experienced) users.

It does entail a major overhaul of the pandas data model, though, which raises a number of questions. In particular: do we still use an Index/MultiIndex for DataFrame.columns? If so, then it follows that column names should now include the names of index columns.

This raises a big issue with how we handle "messy" data (i.e., non-tidy data). Currently, pandas is a pretty capable tool for such datasets, especially with the ability to use stack/unstack columns into hierarchies. But if columns is a MultiIndex with multiple levels, adding in index column names is going to make things a mess.

Consider this example adapted from the multi-index docs:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo',],
          ['one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 6), index=pd.Index(['A', 'B', 'C'], name='letter'), columns=index)
print df
first        bar                 baz                 foo          
second       one       two       one       two       one       two
letter                                                            
A       1.131677  3.008499 -1.513677  0.379074 -0.546790 -2.221491
B      -1.650027  2.157229 -1.030519 -0.187412  0.711109 -0.334537
C       1.226648  0.631318  0.197816  0.494960 -0.435740  1.098061

Do we now need to add in an extra level to the multi-index for the index column name (e.g., letter)? Or do we disallow a MultiIndex for column names altogether and use an index of tuples instead?

In general, I don't think it's worth a huge amount of effort to make it easy work with such data, given how much nicer tidy data is and the existence of multi-dimensional alternatives with a cleaner data model in the form of xarray, but such change would certainly going to break a non-negligible number of workflows. Making indexes optional would be a much less ambiguous win.

That said, it's not clear to me how a Series with an Index fits into this world

I think it could work in roughly the same way it currently does. Pulling a column out of a DataFrame would return Series object, associating with it any indexes on the frame.

shoyer avatar Sep 07 '16 17:09 shoyer

I'll put some thought into this when I have a chance, but: one possibility to consider is exposing a more primitive pandas.Table to the user, as a "DataFrame without the Index".

  • Referencing a column would give a pandas.Array.
  • Combining an Array plus an Index you obtain a Series
  • Combining an Index with a Table produces a DataFrame

The Table would be more like an R data.frame / data frames.

wesm avatar Sep 07 '16 18:09 wesm

Would the Table have the same functionality as a DataFrame? I.e., queries, joins, aggregations, IO, etc?

chrisaycock avatar Sep 07 '16 19:09 chrisaycock

One thought would be to equip Table with the most essential relational algebra and manipulations (add/remove columns, etc.) but make everything deferred (the deferred table DSL I designed for Ibis is one example of such a language that has effectively 1-1 parity with SQL, could provide some inspiration)

wesm avatar Sep 07 '16 20:09 wesm

One possibility to consider is exposing a more primitive pandas.Table to the user, as a "DataFrame without the Index".

I'm a really big fan of the Table data structure. It could do enough for most users and client libraries, and DataFrame could be left for those who need indexing and alignment (which are important but niche use cases).

See also the datascience package for teaching introductory data science in Python.

On the other hand, the downside is that now we have two similar core data structures for tabular data. With the dynamic nature of Python, this could easily lead to confusion, and also twice the API to maintain.

One thought would be to equip Table with the most essential relational algebra and manipulations (add/remove columns, etc.) but make everything deferred

I'm all for deferred APIs, but I'm less sure that this makes sense for the DataFrame/Table distinction. It dilutes the message of Table as "DataFrame without the Index".

shoyer avatar Sep 08 '16 01:09 shoyer

I definitely like the idea of a base Table and then a DataFrame that adds an index. Other than indices, there doesn't need to be a distinction in terms of functionality.

chrisaycock avatar Sep 08 '16 02:09 chrisaycock

@shoyer I agree that having a deferred API as a separate beast would be better, and making the basic table a pared down, indexless DataFrame (with all operations eagerly evaluated). Was just curious what you all thought =)

wesm avatar Sep 08 '16 03:09 wesm

I am probably going to make indexes optional in the next version of xarray (pydata/xarray#1017). (Note that in xarray, an index already is basically just a special kind of column, but currently we always generate an index like range(n).) I guess we'll see how the transition goes, but I am tentatively very optimistic about it.

It occurs to me that an additional virtue of optional indexes is that it could allow us to further cleanup DataFrame.__getitem__ with sane mixing between label and position based indexing, because we can differentiate between intentional integer indexes and no index at all. I'll elaborate over in #22.

shoyer avatar Sep 26 '16 05:09 shoyer