scoobi
scoobi copied to clipboard
Implement sorted merge join
If both data sets are stored sorted on the join key, then its possible to perform the join on the map side. The general idea is to:
- Build up an index of keys to file location/offset of one of the data sets.
- Use the other data set as normal input to a map job.
- For each key, look up the the corresponding file/offset from the index.
- Directly read the file, seeking to the offset.
There are already implementations in both pig and hive, and would be a nice addition to scoobi.
Pigs implementation - http://wiki.apache.org/pig/PigMergeJoin Hives implementation - https://issues.apache.org/jira/browse/HIVE-1194
i need code to implement sort merge join any suggestions ?