daru icon indicating copy to clipboard operation
daru copied to clipboard

joining on index

Open genya0407 opened this issue 8 years ago • 12 comments

I request some means to join two Daru::DataFrame on their index. Like: pandas's one.

genya0407 avatar Aug 26 '16 16:08 genya0407

If I've understood properly,

Input

2.3.1 :006 > l1 = Daru::DataFrame.new({ a: [1,2,3], b: [4,5,6]}, index: ['x','y','z'])
 => #<Daru::DataFrame(3x2)>
       a   b
   x   1   4
   y   2   5
   z   3   6 
2.3.1 :007 > l2 = Daru::DataFrame.new({ c: [7,8,9], d: [10,11,12]}, index: ['x','y','z'])
 => #<Daru::DataFrame(3x2)>
       c   d
   x   7  10
   z   8  11
   y   9  12 

Then, desired output is something like

 => #<Daru::DataFrame(3x4)>
       a   b   c   d
   x   1   4   7   10
   y   2   5   8   11
   z   3   6   9   12 

Right? I'd like to work on this. :smile:

athityakumar avatar Feb 06 '17 12:02 athityakumar

I had a look at the core/merge.rb file, however I'm not able to make out how to check the join function on console (irb). Any help would be appreciated.

athityakumar avatar Feb 06 '17 12:02 athityakumar

You can start a session with rake pry and then call join on your DataFrame like df.join(....).

v0dro avatar Feb 07 '17 10:02 v0dro

If no one is working on it then I am interested to try to work on it.

Shekharrajak avatar Mar 02 '17 05:03 Shekharrajak

@Shekharrajak yes you can.

v0dro avatar Mar 04 '17 16:03 v0dro

I think join must work on the index as default, right ?

Shekharrajak avatar Mar 17 '17 15:03 Shekharrajak

@gnilrets WDYT?

v0dro avatar Mar 19 '17 13:03 v0dro

Makes sense to me...

gnilrets avatar Mar 20 '17 14:03 gnilrets

It took me time to understand the core/merge.rb, since there is no examples/docs added in file.

  1. I see that merge is defined in dataframe.rb with no option. But as we know merge + some condition is called different types of join.

  2. merge must be defined in core/merge.rb and same kind of(like join), option must be used in merge as well.

  3. One special case will be when there is some column(s) is(are) common on df1 and df2 then suffix must be added. (at this time we will first add suffix on both df column then pass into join).

E.g.

irb(main):065:0> left
=> #<Daru::DataFrame(3x2)>
       A   B
   0  A0  B0
   1  A1  B1
   2  A2  B2
irb(main):066:0> right
=> #<Daru::DataFrame(3x2)>
       A   c
  K0  A0  D0
  K2  A1  D2
  K3  A3  D3

irb(main):064:0> left.merge(right)
=> #<Daru::DataFrame(3x4)>
     A_1   B A_2   c
   0  A0  B0  A0  D0
   1  A1  B1  A1  D2
   2  A2  B2  A3  D3


Problem in merge :

  • Doesn't use index to merge on.

see this example :

irb(main):002:0* left = Daru::DataFrame.new({
irb(main):003:2* :A => ['A0', 'A1', 'A2'],
irb(main):004:2* :B => ['B0', 'B1', 'B2']},
irb(main):005:1* index: ['K0', 'K1', 'K2'])
=> #<Daru::DataFrame(3x2)>
       A   B
  K0  A0  B0
  K1  A1  B1
  K2  A2  B2
irb(main):006:0>
irb(main):007:0* right = Daru::DataFrame.new({
irb(main):008:2* :C => ['C0', 'C2', 'C3'],
irb(main):009:2* :D => ['D0', 'D2', 'D3']},
irb(main):010:1* index: ['K0', 'K2', 'K3'])
=> #<Daru::DataFrame(3x2)>
       C   D
  K0  C0  D0
  K2  C2  D2
  K3  C3  D3

# index values are different but still merge

irb(main):011:0> left.merge(right)
=> #<Daru::DataFrame(3x4)>
       A   B   C   D
   0  A0  B0  C0  D0
   1  A1  B1  C2  D2
   2  A2  B2  C3  D3

# when no index is passed then its fine.

irb(main):012:0> left = Daru::DataFrame.new({
irb(main):013:2* :A => ['A0', 'A1', 'A2'],
irb(main):014:2* :B => ['B0', 'B1', 'B2']},
irb(main):015:1* )
=> #<Daru::DataFrame(3x2)>
       A   B
   0  A0  B0
   1  A1  B1
   2  A2  B2
irb(main):016:0> right = Daru::DataFrame.new({
irb(main):017:2* :C => ['C0', 'C2', 'C3'],
irb(main):018:2* :D => ['D0', 'D2', 'D3']},
irb(main):019:1* )
=> #<Daru::DataFrame(3x2)>
       C   D
   0  C0  D0
   1  C2  D2
   2  C3  D3
irb(main):020:0> left.merge(right)
=> #<Daru::DataFrame(3x4)>
       A   B   C   D
   0  A0  B0  C0  D0
   1  A1  B1  C2  D2
   2  A2  B2  C3  D3

Reference :

  1. http://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join

  2. http://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/#mergetypes

Shekharrajak avatar Mar 25 '17 09:03 Shekharrajak

If I understand you correctly, you are basically proposing that parts of join should be included into merge right? Why is that? You can very well do the same thing by specifying options to join.

The purpose of this issue is to come up with suitable APIs to join two dataframes on index.

Also, how is your first example (in the section below Problem with merge) any different from the first? We all know that joining on index is not yet supported. Your example is perfectly demonstrating that.

And why do you want to perform a merge in the first place when the problem is with join?

v0dro avatar Mar 26 '17 11:03 v0dro

Actually my above comment is not much related to this issue, but I think that if 'joining on index' is fixed then using inner join (default join for merge) merge can be done, default on: index. So we don't need to solve separate merge issue .

Shekharrajak avatar Mar 26 '17 13:03 Shekharrajak

Why should the default be changed from :inner to :index? Is there any compelling reason?

So we don't need to solve separate merge issue .

What exactly do you mean by this?

v0dro avatar Mar 28 '17 17:03 v0dro