h icon indicating copy to clipboard operation
h copied to clipboard

Use the job queue to delete annotations from Elasticsearch

Open seanh opened this issue 1 year ago • 0 comments

Fixes https://github.com/hypothesis/h/issues/7840.

Problem: h's annotation delete API doesn't delete annotations from Elasticsearch with complete reliability. Sometimes an annotation is deleted from the DB but not from Elasticsearch. This means we haven't deleted the data that the user asked us to delete and it also causes some nuisance issues (e.g. https://github.com/hypothesis/client/issues/5219, https://github.com/hypothesis/h/issues/7796). This also prevents us from implementing PRD: Self-Service User Deletion: the self-service user deletion feature requires a reliable means of deleting a user's annotations.

Solution: we already use h's job queue to ensure that annotation creates and updates are always synced to Elasticsearch. See this presentation for more details of the solution. Support for syncing annotation deletions was never added to the job queue. This PR adds annotation deletion support to the job queue so that all future deletions will reliably be synced to Elasticsearch.

This PR fixes the annotation delete API only. There'll be a follow-up PR to build on the same solution to make bulk deletion of annotations reliable when deleting user accounts.

Context: what happens when you delete an annotation

This isn't reliable. The app could crash (or be killed by a deployment or autoscaling etc) after committing the DB transaction but before making the request to Elasticsearch. Or in the case of Elasticsearch issues the three requests to Elasticsearch could all fail.

This is fairly easy to fix: everything can remain the same, but the DB transaction also needs to include adding a "sync_annotation" job to the transactional job queue in the DB. Support for deleting annotations then needs to be added to the SearchIndexService.sync() method that handles "sync_annotation" jobs. This will make syncing annotation deletions to Elasticsearch reliable because:

  1. The "sync_annotation" job is added to the job queue in the same atomic DB transaction as the annotation is marked as deleted. So either both the annotation is marked as deleted in the DB and the job is added to the queue, or neither.
  2. Once the "sync_annotation" job is on the queue the sync_annotations() task (which runs periodically and processes these jobs) will not delete the job from the queue until it has successfully deleted the annotation from Elasticsearch and verified that the annotation is deleted from Elasticsearch. If deleting the annotation fails the task will just keep trying and trying. (It actually gives up after 30 days, at which point the job is still on the queue but is "expired", and an alarm will go off notifying us of the presence of an expired job.)

Testing

Deleting an annotation

This tests the happy path, what normally happens when you delete an annotation: the delete annotation API makes a foreground call to "delete" the annotation from Elasticsearch (in fact annotations are left in Elasticsearch with the body replaced with {"deleted": true}) and also adds a "sync_annotation" job to the queue. The "sync_annotation" job later runs, checks that the annotation was deleted from Elasticsearch, and deletes the job from the queue.

  1. Visit http://localhost:5000/docs/help and create an annotation

  2. You should be able to see the annotation in Elasticsearch: http://localhost:9200/hypothesis/_search

  3. A "sync_annotation" job for the annotation will have been added to the queue. Run h-periodic and you should see the sync_annotations() task confirming that the annotation was successfully synced from Postgres to Elasticsearch.

    Due to the job's scheduled_at attribute it isn't available for processing immediately, so at first you may see this output from the sync_annotations() task indicating that it found no jobs to do:

    h.tasks.indexer.sync_annotations[*]: {}
    

    Wait for the task to run again and you should see this output indicating that it checked one annotation and found it to already be up-to-date in Elasticsearch:

    h.tasks.indexer.sync_annotations[*]: {'Completed/storage.create_annotation/Up_to_date_in_Elastic': 1, 'Completed/storage.create_annotation/Total': 1, 'Completed/Total': 1}
    

    The sync_annotations() task didn't actually index anything into Elasticsearch: it just checked and found that the annotation was already up-to-date in Elasticsearch. The job will now have been deleted fro the job table.

  4. Stop h-periodic again

  5. Delete the annotation

  6. You should see that the annotation has been replaced with {"deleted": true} in Elasticsearch: http://localhost:9200/hypothesis/_search. This was not done by the job queue but by the foreground call to Elasticsearch in SearchIndexService.delete_annotation_by_id()

  7. A "sync_annotation" job will be added to the queue. Run h-periodic again and you should see this output from the sync_annotations() task indicating that it confirmed that the annotation had been deleted in Elasticsearch:

    h.tasks.indexer.sync_annotations[*]: {'Completed/AnnotationDeleteService.delete_annotation/Deleted_from_db': 1, 'Completed/AnnotationDeleteService.delete_annotation/Total': 1, 'Completed/Total': 1}
    

    The job will also have been deleted from the queue.

Allowing the job queue to delete an annotation

This tests what happens if the initial foreground request to Elasticsearch to delete the annotation fails and the job queue has to step in later and delete it.

  1. Disable the foreground deleting of annotations from Elasticsearch:

    diff --git a/h/services/search_index.py b/h/services/search_index.py
    index e87752807..4774de3e8 100644
    --- a/h/services/search_index.py
    +++ b/h/services/search_index.py
    @@ -114,8 +114,6 @@ class SearchIndexService:
                 operations
             """
    
    -        self._index_annotation_body(annotation_id, {"deleted": True}, refresh=refresh)
    -
         def handle_annotation_event(self, event):
             """
             Process an annotation event, taking appropriate action to the event.
    

    It's difficult to simulate an Elasticsearch blip or outage by taking down Elasticsearch in the dev environment: when you bring Elasticsearch back up seems to receive requests that were sent while it was down and processes them. So hacking the code as above will have to suffice.

  2. Go to http://localhost:5000/docs/help and create an annotation

  3. You should be able to see the annotation in Elasticsearch: http://localhost:9200/hypothesis/_search

  4. Run h-periodic and wait for the sync_annotations() task to run and confirm that the annotation was synced to Elasticsearch.

    The job may not be scheduled for execution yet the next time the task runs, so you may need to wait for the task to run again. The task will delete the job from the queue:

    h.tasks.indexer.sync_annotations[*]: {'Completed/storage.create_annotation/Up_to_date_in_Elastic': 1, 'Completed/storage.create_annotation/Total': 1, 'Completed/Total': 1}
    
  5. Delete the annotation. This will mark the annotation as deleted in the DB but won't delete it from Elasticsearch. From the client's point of view the delete annotation API request will receive a successful response, and to the user the annotation will appear to have been deleted. A "sync_annotation" job will be added to the queue.

  6. Run h-periodic and wait for the sync_annotation() task to run.

    The job may not be scheduled for execution yet the next time the task runs, so you may need to wait for the task to run again.

    You should see the task syncing the annotation deletion to Elasticsearch:

    h.tasks.indexer.sync_annotations[*]: {'Synced/AnnotationDeleteService.delete_annotation/Deleted_from_db': 1, 'Synced/AnnotationDeleteService.delete_annotation/Total': 1}
    

    At this point the annotation will have been removed from Elasticsearch (http://localhost:9200/hypothesis/_search) but the job will still be in the DB.

    Wait for the task to run again and it'll confirm that the annotation was deleted from Elasticsearch and remove the job from the DB:

    h.tasks.indexer.sync_annotations[*]: {'Completed/AnnotationDeleteService.delete_annotation/Deleted_from_db': 1, 'Completed/AnnotationDeleteService.delete_annotation/Total': 1}
    

Deleting an annotation that's already been purged

The purge_deleted_annotations() task runs hourly whereas the sync_annotations() task runs every minute, so it'll be a normal occurrence for the purge_deleted_annotations() task to run and expunge a deleted annotation from the DB before the sync_annotations() task picks up that annotation's "sync_annotation" job.

To simulate this, delete an annotation and then run make sql and delete the annotation from the DB, and then run h-periodic and wait for the sync_annotations() task to run.

seanh avatar Feb 16 '24 17:02 seanh