scoobi icon indicating copy to clipboard operation
scoobi copied to clipboard

Implement sorted merge join

Open raronson opened this issue 12 years ago • 1 comments

If both data sets are stored sorted on the join key, then its possible to perform the join on the map side. The general idea is to:

  • Build up an index of keys to file location/offset of one of the data sets.
  • Use the other data set as normal input to a map job.
  • For each key, look up the the corresponding file/offset from the index.
  • Directly read the file, seeking to the offset.

There are already implementations in both pig and hive, and would be a nice addition to scoobi.

Pigs implementation - http://wiki.apache.org/pig/PigMergeJoin Hives implementation - https://issues.apache.org/jira/browse/HIVE-1194

raronson avatar Feb 22 '13 01:02 raronson

i need code to implement sort merge join any suggestions ?

kdarshit999 avatar Mar 03 '14 16:03 kdarshit999