index-management icon indicating copy to clipboard operation
index-management copied to clipboard

[FEATURE] support action of delete_by_query

Open kkewwei opened this issue 2 years ago • 5 comments

Is your feature request related to a problem? We use ism to manage index life, and the all documents are written into one index, Some documents will expire and need to be deleted over time. The documents will be updated by the primary key.

What solution would you like? We want to schedule calling delete_by_query to delete the expired documents every day. if we can support the action of delete_by_query.

Do you have any additional context? I can pick up the implementation if accepted.

kkewwei avatar Jul 01 '23 04:07 kkewwei

@bowenlan-amzn hi, if I can pick up the implementation?

kkewwei avatar Sep 08 '23 06:09 kkewwei

sure, thanks!

bowenlan-amzn avatar Sep 08 '23 15:09 bowenlan-amzn

@kkewwei please let me know if you met any blocker on the implementation

bowenlan-amzn avatar Oct 04 '23 00:10 bowenlan-amzn

Also requested here https://github.com/opensearch-project/index-management/issues/918#issuecomment-1741822381

bowenlan-amzn avatar Nov 07 '23 18:11 bowenlan-amzn

@kkewwei hi! some coments from my side.

While implementing this feature is technically feasible, it's crucial to exercise caution before employing it in production environments. The deletion process can have a substantial negative impact on cluster performance.

When a document is earmarked for deletion, it doesn't immediately vanish from the index. Instead, it undergoes a multi-step process that can strain the cluster's resources:

  1. Deletion Marking: The task of removing the document is first marked for execution.
  2. Dedicated Job Execution: A dedicated background job is then triggered to perform the actual deletion. This involves removing the document from its respective segment and rebuilding the inverted index to reflect the changes.

While deletion by query may seem like an appealing feature, it's important to consider its potential impact on cluster performance. When a large number of documents are targeted for deletion, the cluster may allocate a significant portion of its CPU and I/O resources to processing the deletion requests. This can lead to indexing and search operations becoming sluggish or even halting altogether.

In most scenarios, a more practical approach is to segregate documents intended for deletion into a separate index. This dedicated index can then be configured with appropriate rollover and deletion policies to ensure efficient and timely removal of old or irrelevant data.

While deletion by query does offer a convenient mechanism for removing documents, its potential performance drawbacks make it less suitable for production environments. A more controlled and resource-conscious approach, such as using a separate index with rollover policies, is generally recommended for production clusters.

ikibo avatar Nov 15 '23 13:11 ikibo