Duke icon indicating copy to clipboard operation
Duke copied to clipboard

Compilation errors with Master

Open nmadhire opened this issue 10 years ago • 15 comments

Hi,

I just checked out the Master branch code to make some changes to Duke Code and I see lot of compilation errors in the code.

Below is the one in ConfigLoader.java "Error - The value of customcomparators is not used"

Is the Master branch has code related to 1.2 version?

Please let me know.

Thanks.

nmadhire avatar May 14 '15 16:05 nmadhire

Do I need to have some settings in my IDE to ignore such errors?

nmadhire avatar May 14 '15 16:05 nmadhire

Sorry, I figured it out. These are issues with my IDE. I just changed the compilation settings to ignore for unnecessary code. We are starting to use Duke for our record matching. I will let everyone know if I see any issues.

Thanks.

nmadhire avatar May 14 '15 16:05 nmadhire

Good to see it resolved itself. Please let us know if you have more concerns/questions.

larsga avatar May 14 '15 18:05 larsga

One question, Does the master branch has code related to 1.2 version. Looks like there are many changes from the 1.2 version in the master branch.

nmadhire avatar May 14 '15 19:05 nmadhire

One more :) What is the use of this code in Processor.java

// start with source 1 for (Collection<Record> batch : makeBatches(sources1, batch_size)) { index(1, batch); if (hasTwoDatabases()) linkBatch(2, batch, matchall); }

// then source 2
for (Collection<Record> batch : makeBatches(sources2, batch_size)) {
  if (hasTwoDatabases())
    index(2, batch);
  linkBatch(1, batch, matchall);
}

So is this like matching the 1st one with the 2nd one and again matching the 2nd one with the 1st one. Wouldn't we doing this two times?

My aim is to find matches across two files and create a file with matched records. With the above code I will get duplicate records if I do two times matching.

Is my understanding correct?

nmadhire avatar May 14 '15 19:05 nmadhire

One thing I noticed is, for CSVDataSource the input data has to be in double quotes and a delimiter(can be configured in the config file).

Is it necessary to have the input data in quotes?. Is there anything we can change to accept the non quotes data as well?

Thanks. I apologize for asking too many questions. :)

nmadhire avatar May 14 '15 20:05 nmadhire

Master branch: the code on this branch is on the way to the 2.0 release. The 1.2 code is on the 1.2 tag.

Processor code: What the method does is explained in the javadoc comment. It does record linkage of the two groups in the configuration. It's explained in the documentation. If you only have one database (which is the norm), then it will first index up all records in group 1, then match all records in group 2 against those in the index. That sounds like what you want.

CSV: you only need quotes around the values if you have commas or line breaks in the values.

larsga avatar May 15 '15 08:05 larsga

Thanks @larsga I was using 1.2 version before and had to give double quotes for all the fields for Duke to recognize the values. But with the master I don't have to use double quotes anymore.

Also, how can I trigger multi threaded logic. I don't see that logic anymore in Processor.java or am I missing something.

nmadhire avatar May 15 '15 16:05 nmadhire

Also are there any tips/ documentation about how to set probabilities for different columns. We are seeing lot of unusual matching with record linkage.

nmadhire avatar May 15 '15 19:05 nmadhire

Also, can you tell me how can I implement the multi threaded logic? I don't see that in the latest version..

nmadhire avatar May 19 '15 14:05 nmadhire

Multi-threaded logic: Just use the --threads option on the command-line, or setThreads on Processor. It's there.

The tuning guide has guidance on setting the probabilities, including (at the bottom) references to how you can have Duke learn the probabilities.

larsga avatar May 20 '15 10:05 larsga

Looks like multi threaded logic is for deduplication only not being triggered for record linkage.

nmadhire avatar May 20 '15 14:05 nmadhire

It's possible. Please make an issue for it, and I will check as soon as I have time.

larsga avatar May 20 '15 14:05 larsga

Sure. I will do. I modified the code in Processor.java

Commented the for loop and used the match method which has the multithreaded logic.

This is now triggring multi threaded for record linkage as well.

private void linkBatch(int dbno, Collection<Record> batch, boolean matchall) { batchReady(batch.size()); //for (Record r : batch) //match(dbno, r, matchall); match(batch,matchall); batchDone(); }

nmadhire avatar May 20 '15 15:05 nmadhire

Record linkage is working fine after the above change in the code.

nmadhire avatar May 20 '15 15:05 nmadhire