Duke icon indicating copy to clipboard operation
Duke copied to clipboard

Feature request: Add merger based on datasource weight

Open larsga opened this issue 11 years ago • 7 comments

From vasilievip on June 12, 2013 14:49:14

Assuming this use-case:

There are several systems which needs to be scanned for duplicates and, after deduplication process, best record needs to be created for users to update underlying systems. By using datasource for each system one can load rows from each system, but to merge properly each system may have different priority on fields to take from duplicates into best record. Plus age of the row (date of last modification) in underlying system must be taken into account.

Here is some code to take a look: https://svn.java.net/svn/mosaic~mdm/trunk/open-dm-mi/index-core/src/main/java/com/sun/mdm/index/survivor/ And some info on project its used in: https://www.youtube.com/user/HealthIT2?feature=watch (see OHMPI related videos)

Original issue: http://code.google.com/p/duke/issues/detail?id=120

larsga avatar Feb 15 '14 09:02 larsga

From [email protected] on June 12, 2013 05:52:57

Yes, I've been wanting to develop something along these lines for a while now. Duke contains utilities to create clusters containing all matching records, but for now the code stops there.

I guess what you want is to automatically produce a "gold standard" record for each cluster.

Yes, a weight for data source, age of record, and other measures can be used for this.

My problem is that I have a limited amount of source data to play with to develop this. Do you have some example data that you could share?

Status: Accepted
Owner: [email protected]
Labels: -Type-Defect Type-Enhancement

larsga avatar Feb 15 '14 09:02 larsga

From vasilievip on June 12, 2013 06:09:34

I guess what you want is to automatically produce a "gold standard" record for each cluster. Yes, this sounds like what it would be good to get out of duplication

Do you have some example data that you could share? I'll work on getting few samples, but not sure that it will be better than "limited amount of source data" you had. There is unit tests data in mosaic which could be more useful: https://svn.java.net/svn/mosaic~mdm/trunk/open-dm-mi/index-core/src/test/resources/ I'm also thinking that datasource may be the same for each system, e.g. one or many csv files, but some column in this file can be discriminator which determines source system. So, duke datasource may need to have some filters to select proper records.

larsga avatar Feb 15 '14 09:02 larsga

From [email protected] on June 12, 2013 06:21:15

Yes, having a field with an ID for the data source will make a big difference. It's definitely possible right now (as I use it in some applications).

I was thinking of having real data, if possible, so that I can judge the effectiveness of the various possible approaches. For example, I came up with an idea of using clustering techniques to pick the best values for each property, based on distance calculations between the different values. Knowing how well this works, and how to combine it with the other alternatives, is essentially impossible without being able to experiment with real data.

Anyway, I could make an attempt based on the two real data sets I have right now, but the result would definitely be better if I could get hold of one or two more.

larsga avatar Feb 15 '14 09:02 larsga

From vasilievip on June 12, 2013 06:34:42

From mosaic demos - merging of data is done by picking fields by data source weight and then use user to adjust this selection if needed. One way to improve this - learn from users and adjust merger based on what user selected, e.g. if there is some combination of source systems and fields contributed into best-record - take this as a pattern (field 1 from system 2, field X from system Y) and apply for further merging. It seems proper merging can be tricky to implement completely automated.

larsga avatar Feb 15 '14 09:02 larsga

From vasilievip on August 12, 2013 00:27:36

Sample datasets https://github.com/open-city/dedupe/tree/master/test/datasets

larsga avatar Feb 15 '14 09:02 larsga

+1 for automated merging. that would be brilliant.

swamikevala avatar Sep 19 '14 08:09 swamikevala

Yes, I'd really like to add this feature, but I need real data to work with, in order to develop some feeling for what strategies work and what strategies don't. If someone can share real data with me that I can try this on that would help a lot. I'm happy to sign NDAs if necessary.

larsga avatar Sep 19 '14 11:09 larsga