Duke Support for Hadoop processing

From [email protected] on September 04, 2011 15:25:27

Longer-term we should be able to farm out processing work to Hadoop clusters.

Original issue: http://code.google.com/p/duke/issues/detail?id=36

Feb 15 '14 09:02 larsga

From [email protected] on November 04, 2011 03:15:28

Labels: Component-Core

Feb 15 '14 09:02 larsga

From [email protected] on June 05, 2013 08:45:47

Any idea on how you would map the functionality to the map/reduce programming model ?

Other than this I can see a big problem when trying to do quick lookups in the data stored in HDFS, as far as I know the lucene support for files in HDFS is not really that good yet.

That said it would be amazing to be able to use duke in a Hadoop cluster, as the deduplication problem is even trickier in really big datasets.

Feb 15 '14 09:02 larsga

From [email protected] on June 05, 2013 08:52:23

Basically, what you'd have to do is to use a blocking scheme. That is, create a key from each record such that similar records have the same key. Then the mapper goes Record -> (key, Record), and the reducer goes (key, [Record1, Record2, Record3]) -> matching record pairs.

I'm thinking of doing this, but need to review the research literature on creating blocking keys automatically first. Right now, I'm focusing elsewhere.

Feb 15 '14 09:02 larsga

@larsga is there any on-going effort to make Duke work on- MR/Cascading/Cascalog/Spark?

Oct 13 '14 12:10 anujsrc

No, not at the moment. It should be pretty easy to do, though. Mainly I need a data set big enough to require MapReduce. Without that there's not much point in working on it.

Oct 13 '14 12:10 larsga

Is there any gain to use M/R on dedup ? Le 13 oct. 2014 14:32, "Lars Marius Garshol" [email protected] a écrit :

No, not at the moment. It should be pretty easy to do, though. Mainly I need a data set big enough to require MapReduce. Without that there's not much point in working on it.

— Reply to this email directly or view it on GitHub https://github.com/larsga/Duke/issues/37#issuecomment-58885392.

Oct 13 '14 12:10 YannBrrd

For smaller dedup tasks: no.

I have seen a paper that claims it doesn't scale so well with M/R, but I was deeply unconvinced by that paper. Having said that, I don't know that it really will scale with M/R, but I have a hard time seeing why not.

Oct 13 '14 12:10 larsga

I really think it just won't distribute on nodes ... Le 13 oct. 2014 14:40, "Lars Marius Garshol" [email protected] a écrit :

For smaller dedup tasks: no.

I have seen a paper that claims it doesn't scale so well with M/R, but I was deeply unconvinced by that paper. Having said that, I don't know that it really will scale with M/R, but I have a hard time seeing why not.

— Reply to this email directly or view it on GitHub https://github.com/larsga/Duke/issues/37#issuecomment-58886205.

Oct 13 '14 12:10 YannBrrd

No problem there. Just use the blocking functions already used by the MapDB and other blocking backends. Then the map step is record -> blocking key, and finally the reduce step is just matching all the records with the same key against one another.

Oct 13 '14 12:10 larsga

Duke Duke copied to clipboard

Support for Hadoop processing

Duke
Duke copied to clipboard