zingg icon indicating copy to clipboard operation
zingg copied to clipboard

Option to skip/bypass training pairs from the same source?

Open lsbilbro opened this issue 2 years ago • 5 comments

Hi team!

When deduplicating more than 1 data source, have you considered a config to tell zingg to ignore/skip pairs where both records originate from the same source?

In my framework, I allow this setting per source. The idea is that sometimes you might want to only consider pairs across data sources. Perhaps one data source has been mastered and dedup'ed previously?

(if you already have this feature, my apologies!)

lsbilbro avatar Feb 25 '22 15:02 lsbilbro

That is a great idea @lsbilbro - in fact another user had suggested this #54 but at that time we felt that in the current implementation, we are getting good training samples for cases that dont match. We can definitely have that controlled through a flag - or maybe a new phase? What is your view there?

sonalgoyal avatar Feb 25 '22 16:02 sonalgoyal

I think a source-scoped config and a corresponding update to findTrainingData phase makes sense to me. I suppose you could have an alternate phase like findTrainingDataAcrossSources... but that seems less ideal than a config.

I don't know how much it will really matter to zingg, but let me explain why it's such a key step in my framework: My basic workflow is:

  1. generate all candidate pairs
  2. score all candidate pairs
  3. human-label a small subset and train model to label the rest
  4. connectedComponents to get transitive closure of the edges

If I know ahead of time that there should be no pairs where both records are in the same source, then I can (sometimes drastically) reduce the pairs generated in step (1)... making all the follow up steps even faster.

This actually becomes even more important when I'm doing a small incremental deduplication. e.g. maybe I have 100 new records and I want to match them against a source of 20 million previously-deduped records.

I really do not want to waste time generating candidate pairs within that 20 million record dataset... they are already mastered. Any time spent here is wasted.

I only want to spend my time and attention on the tiny subset of candidate pairs that somehow include the new 100 records.

lsbilbro avatar Feb 25 '22 17:02 lsbilbro

In this case, how do you master the existing 20m? Isn’t there a model that you built for them that you would reuse in incremental? Or we tou saying that the sources, or at least one, are such that they are already deduped and so it doesn’t make sense to look within them?

On Fri, 25 Feb 2022 at 10:48 PM, lsbilbro @.***> wrote:

I think a source-scoped config and a corresponding update to findTrainingData phase makes sense to me. I suppose you could have an alternate phase like findTrainingDataAcrossSources... but that seems less ideal than a config.

I don't know how much it will really matter to zingg, but let me explain why it's such a key step in my framework: My basic workflow is:

  1. generate all candidate pairs
  2. score all candidate pairs
  3. human-label a small subset and train model to label the rest
  4. connectedComponents to get transitive closure of the edges

If I know ahead of time that there should be no pairs where both records are in the same source, then I can (sometimes drastically) reduce the pairs generated in step (1)... making all the follow up steps even faster.

This actually becomes even more important when I'm doing a small incremental deduplication. e.g. maybe I have 100 new records and I want to match them against a source of 20 million previously-deduped records.

I really do not want to waste time generating candidate pairs within that 20 million record dataset... they are already mastered. Any time spent here is wasted.

I only want to spend my time and attention on the tiny subset of candidate pairs that somehow include the new 100 records.

— Reply to this email directly, view it on GitHub https://github.com/zinggAI/zingg/issues/151#issuecomment-1051041478, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACLRC44IB5U5B2G4DPXXH3U462WNANCNFSM5PKSR7JQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

-- Cheers, Sonal https://github.com/zinggAI/zingg

sonalgoyal avatar Feb 25 '22 18:02 sonalgoyal

yea, let me clarify.

The common scenario would have two high-level stages.

Stage 1: Initial batch deduplication of data sources a. ingest/prep raw data (say... 30 million records across 3 sources) b. generate candidate pairs across/within all sources c. score candidate pairs d. human-label a subset and train a model to label the rest e. connectedComponents to get clusters f. generate "golden records" by aggregating on the clusterIds (from step 1e) -> this generates, say 20 million mastered entities

Stage 2: Incremental deduplication of new data a. ingest/prep new raw data (say... 100-10,000 records) b. generate candidate pairs but be sure to only look at pairs that include new records c. score candidates d. use existing model to label e. connectedComponents (only if needed... might not be needed at this step)

The real value of filtering the candidate pairs comes in step 2b. As you hinted, we are not building a model in stage 2, we are reusing the one from stage 1. If the incremental data is small enough (say on the order of 100 or so), then I could simply compare each new record to my 20 million target records and use the existing model to classify them as match/no-match... but if the incremental data set gets larger (say 1,000-10,000+) we may need to use a blocking/binning strategy here, too, to reduce the work. And it is at this step where we definitely don't want to waste time comparing mastered records to other mastered records.

I suspect that the zingg workflow may be sufficiently different from what I've outlined above... so it could be that this isn't actually relevant here.

lsbilbro avatar Feb 25 '22 19:02 lsbilbro

Thanks @lsbilbro, this detailed explanation is super helpful. In the case of Zingg, this workflow looks like this

Stage 1: Initial batch deduplication of data sources a. ingest/prep raw data (say... 30 million records across 3 sources) b. findTrainingData/label phases to mark samples c. train phase which builds the blocking tree and the classifier and persists them d. match phase which does the entire deduplication(block, classify, connected components) eventually yielding the z_clusters

Stage 2: Incremental deduplication of new data a. ingest/prep new raw data (say... 100-10,000 records) b. link phase which reuses the above models. It blocks, classifies and builds the clusters in a source aware way eventually yielding the z_clusters

The link phase makes sure that you only compare across the sources and not within a source. (https://docs.zingg.ai/zingg/stepbystep/match#link)

Does this satisfy your requirements?

sonalgoyal avatar Feb 26 '22 11:02 sonalgoyal