zingg
zingg copied to clipboard
Continuous operation use case
- Once a Zingg model is trained, I think I can safely assume that the computational complexity (in big-O) is the same whether doing linking or further de-duplicating, yes?
- Even with complexity being the same, do you consider the most performant mode of operation to be fuzzy matching for deduplication or linking considering the underlying machinery of search/match with blocking and classification with features vectors?
- This may assume that Zingg is search/matching on only one or a few records against a trained model. Make sense?
- This leads me to . . .
- After mastering data into the "Master Database" with Zingg, is there also a practical and viable use case for Zingg where we merely want to match a singular record (likely noisy, and almost certainly missing some field values) with the Master Database? The problems that I see with that are:
- Zingg seems to be meant for mastering in bulk (en masse) with at least two large datasets (files or database); rather, not finding a match with a single record to one large dataset (Master Database). For instance, is there a more convenient way of inputing the singular record in a format such as JSON, streaming, or added as a run-time argument (and as opposed to writing it to a file)?
- For this use case it would be desirable for Zingg to run continuously such that it attaches to the "Master Database" once, for either case:
- making a connection once at start up, then makes queries in an (infinite) sequence,
- or, if a big Parquet/CSV file is involved for reference, reads said file only once, and the novel one/two record arguments as they arrive.
- If I'm not mistaken, as of now, we have to invoke Zingg as a Spark job for each single record to match as they arrive in time, so the "big reference file" is opened and read each time, say, along with re-initializing everything else
- as opposed to invoking once, and continuously executing on each singular record in turn.
- Do you have any recommendations as to best practices for linking records when one (perhaps only) feature is a geohash or geocode? These are alphanumeric identifiers (strings) that has an intrinsic hierarchical spatial data structure. A similarity function could, and should exploit this hierarchical structure, as opposed to Zingg's current built-in similarity functions. Do you have a recommendation for a similarity function in your repertoire other than writing our own custom function?
- Linking is more performant than matching as you match against a master list, so graph computations are not done but all links simply rolled up to the master.
- Zingg can be enhanced to support the incremental use case. For that, we need to understand the deployment patterns a bit more - where is the master saved? What happens to the updates etc.
- If you can explain the geocode/geohash details more, we are happy to provide the custom functions.
@rljonesiii any comments here? Need further help?