Duke
Duke copied to clipboard
Compilation errors with Master
Hi,
I just checked out the Master branch code to make some changes to Duke Code and I see lot of compilation errors in the code.
Below is the one in ConfigLoader.java "Error - The value of customcomparators is not used"
Is the Master branch has code related to 1.2 version?
Please let me know.
Thanks.
Do I need to have some settings in my IDE to ignore such errors?
Sorry, I figured it out. These are issues with my IDE. I just changed the compilation settings to ignore for unnecessary code. We are starting to use Duke for our record matching. I will let everyone know if I see any issues.
Thanks.
Good to see it resolved itself. Please let us know if you have more concerns/questions.
One question, Does the master branch has code related to 1.2 version. Looks like there are many changes from the 1.2 version in the master branch.
One more :) What is the use of this code in Processor.java
// start with source 1 for (Collection<Record> batch : makeBatches(sources1, batch_size)) { index(1, batch); if (hasTwoDatabases()) linkBatch(2, batch, matchall); }
// then source 2
for (Collection<Record> batch : makeBatches(sources2, batch_size)) {
if (hasTwoDatabases())
index(2, batch);
linkBatch(1, batch, matchall);
}
So is this like matching the 1st one with the 2nd one and again matching the 2nd one with the 1st one. Wouldn't we doing this two times?
My aim is to find matches across two files and create a file with matched records. With the above code I will get duplicate records if I do two times matching.
Is my understanding correct?
One thing I noticed is, for CSVDataSource the input data has to be in double quotes and a delimiter(can be configured in the config file).
Is it necessary to have the input data in quotes?. Is there anything we can change to accept the non quotes data as well?
Thanks. I apologize for asking too many questions. :)
Master branch: the code on this branch is on the way to the 2.0 release. The 1.2 code is on the 1.2 tag.
Processor code: What the method does is explained in the javadoc comment. It does record linkage of the two groups in the configuration. It's explained in the documentation. If you only have one database (which is the norm), then it will first index up all records in group 1, then match all records in group 2 against those in the index. That sounds like what you want.
CSV: you only need quotes around the values if you have commas or line breaks in the values.
Thanks @larsga I was using 1.2 version before and had to give double quotes for all the fields for Duke to recognize the values. But with the master I don't have to use double quotes anymore.
Also, how can I trigger multi threaded logic. I don't see that logic anymore in Processor.java or am I missing something.
Also are there any tips/ documentation about how to set probabilities for different columns. We are seeing lot of unusual matching with record linkage.
Also, can you tell me how can I implement the multi threaded logic? I don't see that in the latest version..
Multi-threaded logic: Just use the --threads option on the command-line, or setThreads on Processor. It's there.
The tuning guide has guidance on setting the probabilities, including (at the bottom) references to how you can have Duke learn the probabilities.
Looks like multi threaded logic is for deduplication only not being triggered for record linkage.
It's possible. Please make an issue for it, and I will check as soon as I have time.
Sure. I will do. I modified the code in Processor.java
Commented the for loop and used the match method which has the multithreaded logic.
This is now triggring multi threaded for record linkage as well.
private void linkBatch(int dbno, Collection<Record> batch, boolean matchall) { batchReady(batch.size()); //for (Record r : batch) //match(dbno, r, matchall); match(batch,matchall); batchDone(); }
Record linkage is working fine after the above change in the code.