Duke
Duke copied to clipboard
Different results on InMemoryDatabase vs LuceneDatabase
I have a simple csv file containing 5 records with the following fields: last name, first name, driver's license number and city. Two of the records are identical with being only different on their last name: "THOMAS" vs "THOMAAS".
I also have configuration file (xml) with the comparators, cleaners set up. When I set the database to LuceneDatabase in the configuration file I get no results. When I switch to InMemoryDatabase I get results. I noticed that findCandidateMatches() on LuceneDatabase returns zero results. How could I fix this? Is this a bug?
One possibility is that you're using the 1.2 release and have a field with a high probability of 1.0. In that case there's a bug in the boosting computation that makes this happen. Switching to 0.99 solves the problem. (If that's it.)
Another possibility is that you're not using fuzzy search, so the Lucene search doesn't find THOMAS from THOMAAS (and vice versa).
A third possibility is that the lookup properties chosen by Duke are not the right ones. Try running the command-line tool with --lookups to see which properties Duke searches on.
@larsga Thanks for the reply and the suggestions. I tried all three suggestions but the problem seems to persist. I was wondering if you help check it out if I sent you the files (sample and configuration files). I would really appreciate it. Thanks.
Sure, just email it to me at larsga at garshol.priv.no