pgdedupe issues

Add table cleanup step

At the end of the run, eliminate tables that are not needed for model evaluation/comparison/diagnostics (e.g., map). Blocking tables may be useful for intensive modeling diagnostics, but we will likely...

ecsalomon

Use user-set clustering thresholds

When assigning final ids, use a user-provided threshold. Better yet, allow the user to pass multiple thresholds, and create either multiple unique_map tables or a longer form unique_map that also...

ecsalomon

Document Results Tables

Docs should have an explainer of the outputs, including what is stored in all of the output tables.

ecsalomon

Store labeled examples in a table

Labeled training example pairs should be stored in a table for selection and reuse. Data stored for examples should include: - Source - Source ids - Label - Label date...

ecsalomon

Implement method for matching new records to existing clusters

[This issue](https://github.com/dedupeio/dedupe/issues/538) suggests using gazetteer methods Documentation: https://dedupe.io/developers/library/en/latest/API-documentation.html#gazetteer-objects Code: https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L985

ecsalomon

components 're-filtered'

Hi, Running your example I am receiving several dozen UserWarnings similar to: ...python3.6/site-packages/dedupe/clustering.py:71: UserWarning: A component contained 91851 elements. Components larger than 30000 are re-filtered.... Are there any negative ramifications...

tendres

Add tests!

1

mbauman

Use a custom comparator for ID numbers and DOBs

2

In dedupe's logs, it reports: ```txt INFO:dedupe.index:Removing stop word 47 INFO:dedupe.index:Removing stop word 9- INFO:dedupe.index:Removing stop word 25 ``` We're using String comparisons for both SSN and DOB — it...

mbauman

Better output and reporting of found duplicates

It'd be nice to have better diagnostic outputs printed out after running superdeduper — how many exact matches? How many unique identities? How large is the average cluster? Etc.

mbauman

Use all available information

1

Existing implementation only uses a few fields. Expand that.

jtwalsh0

pgdedupe
pgdedupe copied to clipboard

Metadata

Add table cleanup step

Use user-set clustering thresholds

Document Results Tables

Store labeled examples in a table

Implement method for matching new records to existing clusters

components 're-filtered'

Add tests!

Use a custom comparator for ID numbers and DOBs

Better output and reporting of found duplicates

Use all available information

← Metadata

Owner

Metadata

pgdedupe pgdedupe copied to clipboard

Metadata

← Metadata

Owner

Metadata

pgdedupe
pgdedupe copied to clipboard