Duke icon indicating copy to clipboard operation
Duke copied to clipboard

Different results on InMemoryDatabase vs LuceneDatabase

Open mongoose54 opened this issue 9 years ago • 3 comments

I have a simple csv file containing 5 records with the following fields: last name, first name, driver's license number and city. Two of the records are identical with being only different on their last name: "THOMAS" vs "THOMAAS".

I also have configuration file (xml) with the comparators, cleaners set up. When I set the database to LuceneDatabase in the configuration file I get no results. When I switch to InMemoryDatabase I get results. I noticed that findCandidateMatches() on LuceneDatabase returns zero results. How could I fix this? Is this a bug?

mongoose54 avatar Mar 16 '15 21:03 mongoose54

One possibility is that you're using the 1.2 release and have a field with a high probability of 1.0. In that case there's a bug in the boosting computation that makes this happen. Switching to 0.99 solves the problem. (If that's it.)

Another possibility is that you're not using fuzzy search, so the Lucene search doesn't find THOMAS from THOMAAS (and vice versa).

A third possibility is that the lookup properties chosen by Duke are not the right ones. Try running the command-line tool with --lookups to see which properties Duke searches on.

larsga avatar Mar 17 '15 09:03 larsga

@larsga Thanks for the reply and the suggestions. I tried all three suggestions but the problem seems to persist. I was wondering if you help check it out if I sent you the files (sample and configuration files). I would really appreciate it. Thanks.

mongoose54 avatar Mar 18 '15 02:03 mongoose54

Sure, just email it to me at larsga at garshol.priv.no

larsga avatar Mar 18 '15 07:03 larsga