Duke Compilation errors with Master

Hi,

I just checked out the Master branch code to make some changes to Duke Code and I see lot of compilation errors in the code.

Below is the one in ConfigLoader.java "Error - The value of customcomparators is not used"

Is the Master branch has code related to 1.2 version?

Please let me know.

Thanks.

May 14 '15 16:05 nmadhire

Do I need to have some settings in my IDE to ignore such errors?

May 14 '15 16:05 nmadhire

Sorry, I figured it out. These are issues with my IDE. I just changed the compilation settings to ignore for unnecessary code. We are starting to use Duke for our record matching. I will let everyone know if I see any issues.

Thanks.

May 14 '15 16:05 nmadhire

Good to see it resolved itself. Please let us know if you have more concerns/questions.

May 14 '15 18:05 larsga

One question, Does the master branch has code related to 1.2 version. Looks like there are many changes from the 1.2 version in the master branch.

May 14 '15 19:05 nmadhire

One more :) What is the use of this code in Processor.java

// start with source 1 for (Collection<Record> batch : makeBatches(sources1, batch_size)) { index(1, batch); if (hasTwoDatabases()) linkBatch(2, batch, matchall); }

// then source 2
for (Collection<Record> batch : makeBatches(sources2, batch_size)) {
  if (hasTwoDatabases())
    index(2, batch);
  linkBatch(1, batch, matchall);
}

So is this like matching the 1st one with the 2nd one and again matching the 2nd one with the 1st one. Wouldn't we doing this two times?

My aim is to find matches across two files and create a file with matched records. With the above code I will get duplicate records if I do two times matching.

Is my understanding correct?

May 14 '15 19:05 nmadhire

One thing I noticed is, for CSVDataSource the input data has to be in double quotes and a delimiter(can be configured in the config file).

Is it necessary to have the input data in quotes?. Is there anything we can change to accept the non quotes data as well?

Thanks. I apologize for asking too many questions. :)

May 14 '15 20:05 nmadhire

Master branch: the code on this branch is on the way to the 2.0 release. The 1.2 code is on the 1.2 tag.

Processor code: What the method does is explained in the javadoc comment. It does record linkage of the two groups in the configuration. It's explained in the documentation. If you only have one database (which is the norm), then it will first index up all records in group 1, then match all records in group 2 against those in the index. That sounds like what you want.

CSV: you only need quotes around the values if you have commas or line breaks in the values.

May 15 '15 08:05 larsga

Thanks @larsga I was using 1.2 version before and had to give double quotes for all the fields for Duke to recognize the values. But with the master I don't have to use double quotes anymore.

Also, how can I trigger multi threaded logic. I don't see that logic anymore in Processor.java or am I missing something.

May 15 '15 16:05 nmadhire

Also are there any tips/ documentation about how to set probabilities for different columns. We are seeing lot of unusual matching with record linkage.

May 15 '15 19:05 nmadhire

Also, can you tell me how can I implement the multi threaded logic? I don't see that in the latest version..

May 19 '15 14:05 nmadhire

Multi-threaded logic: Just use the --threads option on the command-line, or setThreads on Processor. It's there.

The tuning guide has guidance on setting the probabilities, including (at the bottom) references to how you can have Duke learn the probabilities.

May 20 '15 10:05 larsga

Looks like multi threaded logic is for deduplication only not being triggered for record linkage.

May 20 '15 14:05 nmadhire

It's possible. Please make an issue for it, and I will check as soon as I have time.

May 20 '15 14:05 larsga

Sure. I will do. I modified the code in Processor.java

Commented the for loop and used the match method which has the multithreaded logic.

This is now triggring multi threaded for record linkage as well.

private void linkBatch(int dbno, Collection<Record> batch, boolean matchall) { batchReady(batch.size()); //for (Record r : batch) //match(dbno, r, matchall); match(batch,matchall); batchDone(); }

May 20 '15 15:05 nmadhire

Record linkage is working fine after the above change in the code.

May 20 '15 15:05 nmadhire

Duke Duke copied to clipboard

Compilation errors with Master

Duke
Duke copied to clipboard