Duke icon indicating copy to clipboard operation
Duke copied to clipboard

Few questions about Duke.

Open mongoose54 opened this issue 11 years ago • 3 comments

Hi all, I am new to Duke and I had a few questions:

  1. Are there any benchmarks available that show the efficiency (accuracy) of Duke?
  2. How fast is for Duke to do one record matching in a large (2M) record database?
  3. Is it possible to expand the matching algorithms to include machine learning approaches such as Neural Nets, Support Vector Machine, etc? Which part of the code should I focus on in order to add such machine learning capabilities?
  4. I would like to Duke to use nicknames when comparing people's names? Do I have to update just the text file "no/priv/garshol/duke/name-mappings.txt" with all the nickname mappings?

mongoose54 avatar Nov 20 '14 17:11 mongoose54

Hi there,

  1. I could put up some benchmarks, but IMHO they would be useless. Accuracy varies with the data available and the amount of noise in the data.
  2. That's hard to answer, as it will depend on the data, the comparators you use, the database (backend) you use, how you've configured it, and the hardware. I've done 1.6M records in less than 10 minutes. Based on that the answer in that case seems to be about 0.000375 seconds.
  3. Yes. It depends on what level you want the algorithm to apply on. If you intend to replace Bayes theorem, then I'd look at the Processor class. If you want to do it at the comparator level you can just plug in a new comparator implementations.
  4. Yes. I'm not sure this is actually a good idea, but you can try it.

larsga avatar Nov 21 '14 08:11 larsga

we ran Duke on our database of 550k records, and it found 60k dupes in under 5 minutes

On 21 November 2014 14:14, Lars Marius Garshol [email protected] wrote:

Hi there,

  1. I could put up some benchmarks, but IMHO they would be useless. Accuracy varies with the data available and the amount of noise in the data.
  2. That's hard to answer, as it will depend on the data, the comparators you use, the database (backend) you use, how you've configured it, and the hardware. I've done 1.6M records in less than 10 minutes. Based on that the answer in that case seems to be about 0.000375 seconds.
  3. Yes. It depends on what level you want the algorithm to apply on. If you intend to replace Bayes theorem, then I'd look at the Processor class. If you want to do it at the comparator level you can just plug in a new comparator implementations.
  4. Yes. I'm not sure this is actually a good idea, but you can try it.

Reply to this email directly or view it on GitHub https://github.com/larsga/Duke/issues/184#issuecomment-63940905.

www.sadhguru.org

swamikevala avatar Nov 21 '14 09:11 swamikevala

Thank you both for the responses. I appreciate it.

@larsga In regards to 2) what kind of database did you use to get these great numbers? In regards to 4) you mentioned that it might not be a good idea. How would you do that?

mongoose54 avatar Nov 21 '14 21:11 mongoose54