zingg
zingg copied to clipboard
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
right now the code has ColName.COL_PREFIX all over. We should see whats needed and then improve the code
Current Matcher has the Graph scoring and other graph stuff which makes them tighly coupled. We should move the scoring to a different class. Also think through other graph stuff...
blocking algorithms are currently heavily dependent on field order, giving vastly different results when field order in fedDefinitions is changed. We should make them more consistent.
may have impact on enterprise also
We add the dataframe to the pipe when we read it, which modifies the original args object. In a way that is ok as we are only enriching the args....
(C) 2021 Zingg.AI -> change year have one header for analytics and zingg.
Current ZFrame has methods like drop(String, String..) which can be replaced with drop(String..)
Currently we have implemented methods in ZFrame that should actually be in Row, Column, StructType etc classes. eg getAsString. One thing to remember - StructField not serializable in Snowpark so...
Preprocessing phase needed which will conver all data to lower case before start of any phase. This is specially relevant for stop words and recommender as currently those are case...