zingg Continuous operation use case

Continuous operation use case

Open rljonesiii opened this issue 2 years ago • 2 comments

Once a Zingg model is trained, I think I can safely assume that the computational complexity (in big-O) is the same whether doing linking or further de-duplicating, yes?
1. Even with complexity being the same, do you consider the most performant mode of operation to be fuzzy matching for deduplication or linking considering the underlying machinery of search/match with blocking and classification with features vectors?
2. This may assume that Zingg is search/matching on only one or a few records against a trained model. Make sense?
3. This leads me to . . .
After mastering data into the "Master Database" with Zingg, is there also a practical and viable use case for Zingg where we merely want to match a singular record (likely noisy, and almost certainly missing some field values) with the Master Database? The problems that I see with that are:
1. Zingg seems to be meant for mastering in bulk (en masse) with at least two large datasets (files or database); rather, not finding a match with a single record to one large dataset (Master Database). For instance, is there a more convenient way of inputing the singular record in a format such as JSON, streaming, or added as a run-time argument (and as opposed to writing it to a file)?
2. For this use case it would be desirable for Zingg to run continuously such that it attaches to the "Master Database" once, for either case:
  - making a connection once at start up, then makes queries in an (infinite) sequence,
  - or, if a big Parquet/CSV file is involved for reference, reads said file only once, and the novel one/two record arguments as they arrive.
3. If I'm not mistaken, as of now, we have to invoke Zingg as a Spark job for each single record to match as they arrive in time, so the "big reference file" is opened and read each time, say, along with re-initializing everything else
  - as opposed to invoking once, and continuously executing on each singular record in turn.
Do you have any recommendations as to best practices for linking records when one (perhaps only) feature is a geohash or geocode? These are alphanumeric identifiers (strings) that has an intrinsic hierarchical spatial data structure. A similarity function could, and should exploit this hierarchical structure, as opposed to Zingg's current built-in similarity functions. Do you have a recommendation for a similarity function in your repertoire other than writing our own custom function?

Jan 17 '22 16:01 rljonesiii

Linking is more performant than matching as you match against a master list, so graph computations are not done but all links simply rolled up to the master.
Zingg can be enhanced to support the incremental use case. For that, we need to understand the deployment patterns a bit more - where is the master saved? What happens to the updates etc.
If you can explain the geocode/geohash details more, we are happy to provide the custom functions.

Jan 18 '22 02:01 sonalgoyal

@rljonesiii any comments here? Need further help?

Feb 12 '22 10:02 sonalgoyal

zingg zingg copied to clipboard

Continuous operation use case

zingg
zingg copied to clipboard