Duke icon indicating copy to clipboard operation
Duke copied to clipboard

Support for Hadoop processing

Open larsga opened this issue 11 years ago • 9 comments

From [email protected] on September 04, 2011 15:25:27

Longer-term we should be able to farm out processing work to Hadoop clusters.

Original issue: http://code.google.com/p/duke/issues/detail?id=36

larsga avatar Feb 15 '14 09:02 larsga

From [email protected] on November 04, 2011 03:15:28

Labels: Component-Core

larsga avatar Feb 15 '14 09:02 larsga

From [email protected] on June 05, 2013 08:45:47

Any idea on how you would map the functionality to the map/reduce programming model ?

Other than this I can see a big problem when trying to do quick lookups in the data stored in HDFS, as far as I know the lucene support for files in HDFS is not really that good yet.

That said it would be amazing to be able to use duke in a Hadoop cluster, as the deduplication problem is even trickier in really big datasets.

larsga avatar Feb 15 '14 09:02 larsga

From [email protected] on June 05, 2013 08:52:23

Basically, what you'd have to do is to use a blocking scheme. That is, create a key from each record such that similar records have the same key. Then the mapper goes Record -> (key, Record), and the reducer goes (key, [Record1, Record2, Record3]) -> matching record pairs.

I'm thinking of doing this, but need to review the research literature on creating blocking keys automatically first. Right now, I'm focusing elsewhere.

larsga avatar Feb 15 '14 09:02 larsga

@larsga is there any on-going effort to make Duke work on- MR/Cascading/Cascalog/Spark?

anujsrc avatar Oct 13 '14 12:10 anujsrc

No, not at the moment. It should be pretty easy to do, though. Mainly I need a data set big enough to require MapReduce. Without that there's not much point in working on it.

larsga avatar Oct 13 '14 12:10 larsga

Is there any gain to use M/R on dedup ? Le 13 oct. 2014 14:32, "Lars Marius Garshol" [email protected] a écrit :

No, not at the moment. It should be pretty easy to do, though. Mainly I need a data set big enough to require MapReduce. Without that there's not much point in working on it.

— Reply to this email directly or view it on GitHub https://github.com/larsga/Duke/issues/37#issuecomment-58885392.

YannBrrd avatar Oct 13 '14 12:10 YannBrrd

For smaller dedup tasks: no.

I have seen a paper that claims it doesn't scale so well with M/R, but I was deeply unconvinced by that paper. Having said that, I don't know that it really will scale with M/R, but I have a hard time seeing why not.

larsga avatar Oct 13 '14 12:10 larsga

I really think it just won't distribute on nodes ... Le 13 oct. 2014 14:40, "Lars Marius Garshol" [email protected] a écrit :

For smaller dedup tasks: no.

I have seen a paper that claims it doesn't scale so well with M/R, but I was deeply unconvinced by that paper. Having said that, I don't know that it really will scale with M/R, but I have a hard time seeing why not.

— Reply to this email directly or view it on GitHub https://github.com/larsga/Duke/issues/37#issuecomment-58886205.

YannBrrd avatar Oct 13 '14 12:10 YannBrrd

No problem there. Just use the blocking functions already used by the MapDB and other blocking backends. Then the map step is record -> blocking key, and finally the reduce step is just matching all the records with the same key against one another.

larsga avatar Oct 13 '14 12:10 larsga