deduplicate-elasticsearch icon indicating copy to clipboard operation
deduplicate-elasticsearch copied to clipboard

Remove duplicate documents from Elasticsearch

deduplicate-elasticsearch

A python script to detect duplicate documents in Elasticsearch. Once duplicates have been detected, it is straightforward to call a delete operation to remove duplicates.

For a full description on how this script works including an analysis of the memory requirements, see: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/