Duke
Duke copied to clipboard
Support for Hadoop processing
From [email protected] on September 04, 2011 15:25:27
Longer-term we should be able to farm out processing work to Hadoop clusters.
Original issue: http://code.google.com/p/duke/issues/detail?id=36
From [email protected] on June 05, 2013 08:45:47
Any idea on how you would map the functionality to the map/reduce programming model ?
Other than this I can see a big problem when trying to do quick lookups in the data stored in HDFS, as far as I know the lucene support for files in HDFS is not really that good yet.
That said it would be amazing to be able to use duke in a Hadoop cluster, as the deduplication problem is even trickier in really big datasets.
From [email protected] on June 05, 2013 08:52:23
Basically, what you'd have to do is to use a blocking scheme. That is, create a key from each record such that similar records have the same key. Then the mapper goes Record -> (key, Record), and the reducer goes (key, [Record1, Record2, Record3]) -> matching record pairs.
I'm thinking of doing this, but need to review the research literature on creating blocking keys automatically first. Right now, I'm focusing elsewhere.
@larsga is there any on-going effort to make Duke work on- MR/Cascading/Cascalog/Spark?
No, not at the moment. It should be pretty easy to do, though. Mainly I need a data set big enough to require MapReduce. Without that there's not much point in working on it.
Is there any gain to use M/R on dedup ? Le 13 oct. 2014 14:32, "Lars Marius Garshol" [email protected] a écrit :
No, not at the moment. It should be pretty easy to do, though. Mainly I need a data set big enough to require MapReduce. Without that there's not much point in working on it.
— Reply to this email directly or view it on GitHub https://github.com/larsga/Duke/issues/37#issuecomment-58885392.
For smaller dedup tasks: no.
I have seen a paper that claims it doesn't scale so well with M/R, but I was deeply unconvinced by that paper. Having said that, I don't know that it really will scale with M/R, but I have a hard time seeing why not.
I really think it just won't distribute on nodes ... Le 13 oct. 2014 14:40, "Lars Marius Garshol" [email protected] a écrit :
For smaller dedup tasks: no.
I have seen a paper that claims it doesn't scale so well with M/R, but I was deeply unconvinced by that paper. Having said that, I don't know that it really will scale with M/R, but I have a hard time seeing why not.
— Reply to this email directly or view it on GitHub https://github.com/larsga/Duke/issues/37#issuecomment-58886205.
No problem there. Just use the blocking functions already used by the MapDB and other blocking backends. Then the map step is record -> blocking key, and finally the reduce step is just matching all the records with the same key against one another.