Improve user XP when using delete API
When a user post a valid delete query, he will receive the following response OK
{
"create_timestamp": 1000000,
"opstamp": 3,
"delete_query": {
"index_id": "my-index",
"start_timestamp": 1000000,
"query": "body:trash"
}
}
This is fine but... the user will probably want to check if the documents to be deleted are still there or not. He will typically do a search query to check that. Unfortunately, he may still find some documents even if there are no more deletes running... This is explained by the fact that some documents can be in immature splits and they will be deleted later when the split becomes mature. But the user has no way to check that or to check that the delete query has finished. The consequence is that he will believe there is a bug in Quickwit.
We can improve in different ways this XP. For example:
- add new search parameters to filter out splits on maturity and
delete_opstamp. This will enable us to return accurate results. - add a query params that will return all splits that contains matching documents.
- add an endpoint on the delete API that will check which splits need to undergo some delete operation
- and possibly more.
Hi Fracois,
I am currently considering using Quickwit with my team for some workloads and one tricky requirement we need to handle is the GDPR deletions. We usually handle thousands of requests per day, possibly going upwards to millions in coming years. Quickwit is limited to a few dozen per second, making the adoption harder.
In our case, these operations doesn't need to be in real time, we could enqueue the delete operations and they are triggered once at a latter point.
Would it be feasible to add something like this to be able to handle thousands of delete requests a day and execute as a single processing operation?
Our goal is to be able to use it on datasets reaching 100s TBs. Ideally we would want statistics to track how many documents were deleted for each operation, otherwise we would need to query it first.