pgdedupe
pgdedupe copied to clipboard
A simple command line interface to the datamade/dedupe library.
At the end of the run, eliminate tables that are not needed for model evaluation/comparison/diagnostics (e.g., map). Blocking tables may be useful for intensive modeling diagnostics, but we will likely...
When assigning final ids, use a user-provided threshold. Better yet, allow the user to pass multiple thresholds, and create either multiple unique_map tables or a longer form unique_map that also...
Docs should have an explainer of the outputs, including what is stored in all of the output tables.
Labeled training example pairs should be stored in a table for selection and reuse. Data stored for examples should include: - Source - Source ids - Label - Label date...
[This issue](https://github.com/dedupeio/dedupe/issues/538) suggests using gazetteer methods Documentation: https://dedupe.io/developers/library/en/latest/API-documentation.html#gazetteer-objects Code: https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L985
Hi, Running your example I am receiving several dozen UserWarnings similar to: ...python3.6/site-packages/dedupe/clustering.py:71: UserWarning: A component contained 91851 elements. Components larger than 30000 are re-filtered.... Are there any negative ramifications...
In dedupe's logs, it reports: ```txt INFO:dedupe.index:Removing stop word 47 INFO:dedupe.index:Removing stop word 9- INFO:dedupe.index:Removing stop word 25 ``` We're using String comparisons for both SSN and DOB — it...
It'd be nice to have better diagnostic outputs printed out after running superdeduper — how many exact matches? How many unique identities? How large is the average cluster? Etc.
Existing implementation only uses a few fields. Expand that.