datapusher-plus
datapusher-plus copied to clipboard
Smarter automatic deduplication
Automatic deduplication works well (#25), however, when duplicates are found and removed, the datastore table and the resource file are no longer in sync.
Smarter dedup can be handled three ways. When dupes are found:
- Stop the DP+ job and show the dupe error in the Datastore tab.
- Replace the resource file with the dedupped CSV.
- Take advantage of
qsv dedup
's--dupes-output
option and create two new resources - RESOURCENAME_dupes.csv and RESOURCENAME_dedupped.csv which are pushed to the Datastore. The original resource with dupes is NOT pushed. The Data Publisher can then just use the CKAN interface to manage which resource to keep (e.g. delete the original and the _dupes resources; rename the _dedupped resource, removing the _dedupped suffix.)