daru
daru copied to clipboard
joining on index
I request some means to join two Daru::DataFrame
on their index.
Like: pandas's one.
If I've understood properly,
Input
2.3.1 :006 > l1 = Daru::DataFrame.new({ a: [1,2,3], b: [4,5,6]}, index: ['x','y','z'])
=> #<Daru::DataFrame(3x2)>
a b
x 1 4
y 2 5
z 3 6
2.3.1 :007 > l2 = Daru::DataFrame.new({ c: [7,8,9], d: [10,11,12]}, index: ['x','y','z'])
=> #<Daru::DataFrame(3x2)>
c d
x 7 10
z 8 11
y 9 12
Then, desired output is something like
=> #<Daru::DataFrame(3x4)>
a b c d
x 1 4 7 10
y 2 5 8 11
z 3 6 9 12
Right? I'd like to work on this. :smile:
I had a look at the core/merge.rb
file, however I'm not able to make out how to check the join
function on console (irb). Any help would be appreciated.
You can start a session with rake pry
and then call join
on your DataFrame like df.join(....)
.
If no one is working on it then I am interested to try to work on it.
@Shekharrajak yes you can.
I think join
must work on
the index as default, right ?
@gnilrets WDYT?
Makes sense to me...
It took me time to understand the core/merge.rb
, since there is no examples/docs added in file.
-
I see that
merge
is defined indataframe.rb
with nooption
. But as we knowmerge
+ some condition is called different types ofjoin
. -
merge
must be defined incore/merge.rb
and same kind of(likejoin
),option
must be used inmerge
as well. -
One special case will be when there is some column(s) is(are) common on
df1
anddf2
then suffix must be added. (at this time we will first add suffix on both df column then pass intojoin
).
E.g.
irb(main):065:0> left
=> #<Daru::DataFrame(3x2)>
A B
0 A0 B0
1 A1 B1
2 A2 B2
irb(main):066:0> right
=> #<Daru::DataFrame(3x2)>
A c
K0 A0 D0
K2 A1 D2
K3 A3 D3
irb(main):064:0> left.merge(right)
=> #<Daru::DataFrame(3x4)>
A_1 B A_2 c
0 A0 B0 A0 D0
1 A1 B1 A1 D2
2 A2 B2 A3 D3
Problem in merge :
- Doesn't use
index
to mergeon
.
see this example :
irb(main):002:0* left = Daru::DataFrame.new({
irb(main):003:2* :A => ['A0', 'A1', 'A2'],
irb(main):004:2* :B => ['B0', 'B1', 'B2']},
irb(main):005:1* index: ['K0', 'K1', 'K2'])
=> #<Daru::DataFrame(3x2)>
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
irb(main):006:0>
irb(main):007:0* right = Daru::DataFrame.new({
irb(main):008:2* :C => ['C0', 'C2', 'C3'],
irb(main):009:2* :D => ['D0', 'D2', 'D3']},
irb(main):010:1* index: ['K0', 'K2', 'K3'])
=> #<Daru::DataFrame(3x2)>
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
# index values are different but still merge
irb(main):011:0> left.merge(right)
=> #<Daru::DataFrame(3x4)>
A B C D
0 A0 B0 C0 D0
1 A1 B1 C2 D2
2 A2 B2 C3 D3
# when no index is passed then its fine.
irb(main):012:0> left = Daru::DataFrame.new({
irb(main):013:2* :A => ['A0', 'A1', 'A2'],
irb(main):014:2* :B => ['B0', 'B1', 'B2']},
irb(main):015:1* )
=> #<Daru::DataFrame(3x2)>
A B
0 A0 B0
1 A1 B1
2 A2 B2
irb(main):016:0> right = Daru::DataFrame.new({
irb(main):017:2* :C => ['C0', 'C2', 'C3'],
irb(main):018:2* :D => ['D0', 'D2', 'D3']},
irb(main):019:1* )
=> #<Daru::DataFrame(3x2)>
C D
0 C0 D0
1 C2 D2
2 C3 D3
irb(main):020:0> left.merge(right)
=> #<Daru::DataFrame(3x4)>
A B C D
0 A0 B0 C0 D0
1 A1 B1 C2 D2
2 A2 B2 C3 D3
Reference :
-
http://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join
-
http://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/#mergetypes
If I understand you correctly, you are basically proposing that parts of join
should be included into merge
right? Why is that? You can very well do the same thing by specifying options to join
.
The purpose of this issue is to come up with suitable APIs to join two dataframes on index.
Also, how is your first example (in the section below Problem with merge
) any different from the first? We all know that joining on index is not yet supported. Your example is perfectly demonstrating that.
And why do you want to perform a merge
in the first place when the problem is with join
?
Actually my above comment is not much related to this issue, but I think that if 'joining on index' is fixed then using inner
join (default join for merge
) merge
can be done, default on: index
. So we don't need to solve separate merge
issue .
Why should the default be changed from :inner
to :index
? Is there any compelling reason?
So we don't need to solve separate merge issue .
What exactly do you mean by this?