daru icon indicating copy to clipboard operation
daru copied to clipboard

Should we support duplicated keys in an index?

Open yui-knk opened this issue 8 years ago • 16 comments

Daru::Vector.new([1,2,3], index: [1,1,1])
=> IndexError: Expected index size >= vector size. Index size : 1, vector size : 3

Daru::Index.new([1,1,1])
=> #<Daru::Index(1): {1}>

Index#initialize uses @relation_hash = index.each_with_index.to_h.freeze, this causes

[1,2,1].each_with_index.to_h
=> {1=>2, 2=>1}

yui-knk avatar Sep 12 '16 14:09 yui-knk

Yes that's right, but that will require restructuring a lot of internal code and data structures which would only make sense if users actually want duplicate entries into the index. This is open to discussion and we can decide on something once we get some feedback.

@gnilrets @mrkn @zverok

v0dro avatar Sep 17 '16 15:09 v0dro

To be honest, at this point I'm feeling like I started to lose entire index idea. My understanding was going through those stages:

  1. We have a 2d table with named columns and rows (The DataFrame), ok
  2. Natural row indexing would be just (0,1,2,3...)
  3. Rows could be named or labelled somehow, which is incredibly useful, because now you can say "give me row for Jan 11" or "give me row for Department B" naturally (same about columns, which are also index-ed)
  4. So far, so good. The core of this logic is that values in index are unique, so providing one value, you obtain one row/column -- that's the core idea of an index.
  5. OK, introducing MultiIndex concept and proper DateIndex complicates things a bit, but they are still graspable: you can provide partial value (only beginning of multi index key, only year of date index), and receive several rows. But that's explainable, and you still can receive exactly one row by fully-qualified index value.
  6. Then, it turned out that CategoricalIndex is allowed to have several equal values, right? That's where you started to losing me (in "I can't understand it" sense, not in "I am running away in panic" sense). I understand the feature was copied from pandas, but what is right way to think about it?.. What is the "physical sense" of having dataframe with index [a, a, b, a, c, a, b] and trying to get row(s) [a] from the dataframe? ¯_(ツ)_/¯
  7. So, what's the point of "index with duplicated values"? Considering 1-5, there is none. But considering (6)... Who am I to decide?..

PS: As a side note, in my head, behavior in example is right, but error text is wrong. It should be something like "You are trying to create index from array with duplications!" or something.

zverok avatar Sep 17 '16 16:09 zverok

@v0dro, @yui-knk I think if you need to index with a column that can take duplicate values then I think Daru::CategoricalIndex will best suit your need. In fact, this is one of the core uses of Categorical Index.

>> idx = Daru::CategoricalIndex.new [1, 2, 1]
=> #<Daru::CategoricalIndex(3): {1, 2, 1}>
>> dv = Daru::Vector.new [:a, :b, :c], index: idx
=> #<Daru::Vector(3)>
   1   a
   2   b
   1   c
>> dv[1]
=> #<Daru::Vector(2)>
   1   a
   1   c

@zverok

What is the "physical sense" of having dataframe with index [a, a, b, a, c, a, b] and trying to get row(s) [a] from the dataframe?

I think its helpful when say you want to identify the rows in the dataframe with a categorical property. More often than not we identify the entries with a unique property such as id. In such cases naive Index class is useful but there are other cases in which we might want to identify the rows with a qualitative property, in that case CategoricalIndex will be useful. Here's one such example. In this example, we index the dataframe by countries.

The way I see index is a way to identify different rows and I see no reason why we should restrict it to only identify unique rows. Although this is usually the case.

Also, I think the idea of index is to support fast retrieval which is also why I think CategoricalIndex is an asset.

lokeshh avatar Sep 17 '16 16:09 lokeshh

@lokeshh To be honest, for me your example still looks like just usual MultiIndex usage would be 1000 times more consistent (not needing an explanations for dummies like me, saying "ok, index is something consisting of unique values, unless that's a CategoryIndex"). Like, you have your source dataframe, you do "some operation", and you have it indexed with MultiIndex having first "column" (value of tuple) being category, and the second being an original id of this row.

To be honest twice, df.row['Africa'].size # => 46 is not the most predictable behavior :)

zverok avatar Sep 17 '16 17:09 zverok

(But I am not a data scientist, this entire "categories" thing for me is terra incognita at most)

zverok avatar Sep 17 '16 17:09 zverok

@zverok You have a point assuming the definition of index as something to uniquely identify each row. So, I think it all boils down to whether you think about index as something which uniquely identifies each row or not. I don't know how data scientists think about index but in database terminology, key or index can take non-unique values. Its the primary key or primary index which has to be unique. More about it here.

lokeshh avatar Sep 17 '16 17:09 lokeshh

So I understand an index as a unique identifier to the row. Isn't that how vectors in a data frame line up in to rows?

But what can you accomplish with a vector with non unique indexes that you can't accomplish with a data frame with two columns?

gnilrets avatar Sep 17 '16 18:09 gnilrets

@gnilrets Vector with non unique indexes guarantee fast retrieval which data frame with two columns does not.

lokeshh avatar Sep 17 '16 18:09 lokeshh

I'm not sure I understand how the concept "guarantees" it will be faster, particularly if we start overloading it with functionality we don't strictly require to manipulate data.

gnilrets avatar Sep 17 '16 18:09 gnilrets

I'm not sure I understand how the concept "guarantees" it will be faster,...

@gnilrets For instance see this use case. Here we often want to retrieve the rows with index :idx. Now if we index it with categorical index then the access will be much much faster. This is the example with only two columns used and with more columns the performance gain will be even more.

lokeshh avatar Sep 17 '16 18:09 lokeshh

What prevents us from refactoring #where to be as fast?

gnilrets avatar Sep 17 '16 18:09 gnilrets

@gnilrets Nothing.

I think in fact we should optimize #where irrelevant of whether we have CategoricalIndex or not.

So the question rises whether we should drop CategoricalIndex because other things can accomplish the same? Inspite of my surrender to that all things which are accomplishable via CategoricalIndex can still be accomplished though other things, I'm still of the opinion that CategoricalIndex is required to fills the gap for need to have non-unique indexing.

lokeshh avatar Sep 20 '16 23:09 lokeshh

I've never used pandas, so maybe I just don't get the utility of having categorical indices. But I think I've been able to handle selecting, joining, grouping, and aggregating data using a concept of a categorical variable that is functionally equivalent to a simple string variable.

gnilrets avatar Sep 20 '16 23:09 gnilrets

Since CategoricalIndex is an "index", it can keep looking up items by category values faster than using #where.

mrkn avatar Sep 25 '16 12:09 mrkn

To be honest, the more I think of this "categorical" things, the more I became confused. We borrowed it in total from pandas, I know, and it was a lot of great work, but...

For me, there are two things are mixing constantly: categorical vector as a concept (you can explain it to other programmers, not being into statistics, as "it's just a enum, nothing new!"), and, like, details of implementation and effectiveness, leaking into abstract concepts.

Even considering all the explanation, for me "special kind of index could have duplicating values" still feels like something deeply wrong.

zverok avatar Sep 25 '16 15:09 zverok

What prevents us from refactoring #where to be as fast?

I just read about index maintenance cost and want to share a new insight with you. Making the #where fast by default isn't a good idea because that would slow down the update operations like insert/deleting a row and many more because of index maintenance cost (every time a row is inserted the index has to be rebuild that is making the #where fast).

lokeshh avatar Nov 16 '16 04:11 lokeshh