zingg
zingg copied to clipboard
SQL based blocking and distance functions
What if we could take sql from say a dbt model or otherwise and use that for our model training - blocking as well as similarity? Then non Java programmers can also code and customize Zingg without bothering about the internals.
Let us discuss this @navinrathore
We can have two files blockingFunctions.txt and similarityFubctions.txt. Each contains multiple functions as sql
blocking file has name of function followed by input column type and output column type and Sql. The from clauses have same name as name of the data in config NameFn: string, string Select name from test;
similarly other functions
similarity functions always return double but
--blockingFunctions, --similarityFunctions with location to file
- all work to be done in the new branch.
- take flags --blockingFunctions, --simlarityFunctions in ClientOptions and Arguments and Client
- parse and set the right value in above
- define the yml for blocking functions, also take care of linking etc (Sonal)
- define blocking function interface (Sonal)
- figure out which lib etc to use for parsing the yml, possibly Jackson
- read the yml and build the blocking functions as per the interface
- register the hash functions in the registry
- build blocking tree(Sonal)
- test(Sonal/Navin)