datapusher-plus icon indicating copy to clipboard operation
datapusher-plus copied to clipboard

Smarter automatic deduplication

Open jqnatividad opened this issue 2 years ago • 0 comments

Automatic deduplication works well (#25), however, when duplicates are found and removed, the datastore table and the resource file are no longer in sync.

Smarter dedup can be handled three ways. When dupes are found:

  1. Stop the DP+ job and show the dupe error in the Datastore tab.
  2. Replace the resource file with the dedupped CSV.
  3. Take advantage of qsv dedup's --dupes-output option and create two new resources - RESOURCENAME_dupes.csv and RESOURCENAME_dedupped.csv which are pushed to the Datastore. The original resource with dupes is NOT pushed. The Data Publisher can then just use the CKAN interface to manage which resource to keep (e.g. delete the original and the _dupes resources; rename the _dedupped resource, removing the _dedupped suffix.)

jqnatividad avatar May 03 '22 09:05 jqnatividad