zingg icon indicating copy to clipboard operation
zingg copied to clipboard

SQL based blocking and distance functions

Open sonalgoyal opened this issue 2 years ago • 4 comments

What if we could take sql from say a dbt model or otherwise and use that for our model training - blocking as well as similarity? Then non Java programmers can also code and customize Zingg without bothering about the internals.

sonalgoyal avatar Oct 16 '21 16:10 sonalgoyal

Let us discuss this @navinrathore

sonalgoyal avatar Dec 28 '21 03:12 sonalgoyal

We can have two files blockingFunctions.txt and similarityFubctions.txt. Each contains multiple functions as sql

blocking file has name of function followed by input column type and output column type and Sql. The from clauses have same name as name of the data in config NameFn: string, string Select name from test;

similarly other functions

similarity functions always return double but

sonalgoyal avatar Dec 28 '21 09:12 sonalgoyal

--blockingFunctions, --similarityFunctions with location to file

sonalgoyal avatar Dec 29 '21 07:12 sonalgoyal

  • all work to be done in the new branch.
  • take flags --blockingFunctions, --simlarityFunctions in ClientOptions and Arguments and Client
  • parse and set the right value in above
  • define the yml for blocking functions, also take care of linking etc (Sonal)
  • define blocking function interface (Sonal)
  • figure out which lib etc to use for parsing the yml, possibly Jackson
  • read the yml and build the blocking functions as per the interface
  • register the hash functions in the registry
  • build blocking tree(Sonal)
  • test(Sonal/Navin)

sonalgoyal avatar Dec 29 '21 07:12 sonalgoyal