anonlink-entity-service icon indicating copy to clipboard operation
anonlink-entity-service copied to clipboard

Delete run leading to server error

Open hardbyte opened this issue 4 years ago • 1 comments

This race condition occured while running the integration tests here.

At first glance it seems like the celery worker correctly threw a DB exception in the compute_filter_similarity task, leading to the callback throwing an exception... so far this is fine. I note that the celery worker then tries to write to the runs table again to mark the run as failed:

UPDATE runs SET
              state = 'error',
              time_completed = now()
WHERE ...

Over in the flask app we queue a task to delete any similarity files then start a db transaction to delete the run. It is conceivable that the delete files task finishes before a file is written.

Lastly in the flask app, we don't catch the DB integrity error and end up giving the user a 500 response :-/

Perhaps this could be addressed more like project deletion - by setting the run as deleted in the database instead of deleting the row.

Note I didn't get to see the high memory worker logs - see #418

Logs

Integration test failure:

 ============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-3.5.1, py-1.8.0, pluggy-0.6.0
rootdir: /var/www, inifile:
plugins: celery-4.3.0
collected 371 items
entityservice/tests/test_admin.py ..                                     [  0%]
entityservice/tests/test_deletes.py ..............F
=================================== FAILURES ===================================
______ test_delete_runs_after_completed_run[result_type_number_parties2] _______
requests = <requests.sessions.Session object at 0x7ff4d5ebdfd0>
result_type_number_parties = ('permutations', 2)
    def test_delete_runs_after_completed_run(requests, result_type_number_parties):
        project, run_id = _create_data_linkage_run(requests, result_type_number_parties)
        result_token = project['result_token']
        wait_for_run_completion(requests, project, run_id, result_token, timeout=20)
>       delete_run(requests, project['project_id'], run_id, result_token)
entityservice/tests/test_deletes.py:23: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
entityservice/tests/util.py:219: in delete_run
    _check_delete_response(r)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
r = <Response [500]>
    def _check_delete_response(r):
        # Note we allow for a 403 because the project may already have been deleted
>       assert r.status_code in {204, 403}, 'I received this instead: {}'.format(r.text)
E       AssertionError: I received this instead: {
E         "detail": "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.",
E         "status": 500,
E         "title": "Internal Server Error",
E         "type": "about:blank"
E       }
entityservice/tests/util.py:201: AssertionError
--------------------- generated xml file: /mnt/results.xml ---------------------
===================== 1 failed, 16 passed in 46.97 seconds ===================== 

Flask:

 2019-09-05T06:29:37.944268Z [debug    ] Run status authorized using result_token token [entityservice.views.run.status] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:37.946954Z [debug    ] Connecting to redis            [entityservice.cache] 
2019-09-05T06:29:37.948791Z [debug    ] Looking up read only redis slave using sentinel protocol [entityservice.cache] 
2019-09-05T06:29:38.686373Z [debug    ] request run status             [entityservice.views.run.status] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:38.708630Z [debug    ] checking if provided authorization is a results_token [entityservice.views.auth_checks] 
2019-09-05T06:29:38.727345Z [debug    ] Run status authorized using result_token token [entityservice.views.run.status] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:38.954044Z [debug    ] request to delete run          [entityservice.views.run.description] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:38.956113Z [debug    ] checking authorization of 'result_token' [entityservice.views.auth_checks] 
2019-09-05T06:29:38.957657Z [debug    ] approved request to delete run [entityservice.views.run.description] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:38.957912Z [debug    ] Retrieving run details from database [entityservice.views.run.description] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:38.958936Z [debug    ] Queuing task to remove similarities file from object store [entityservice.views.run.description] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:39.040080Z [debug    ] start db transaction to delete run [entityservice.database.deletion] rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
Exception on /projects/796049f3ca46af151d53da399b119cd8370118a5c1139a9f/runs/17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d [DELETE]
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/usr/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/lib/python3.7/site-packages/connexion/decorators/decorator.py", line 73, in wrapper
    response = function(request)
  File "/usr/lib/python3.7/site-packages/connexion/decorators/validation.py", line 293, in wrapper
    return function(request)
  File "/usr/lib/python3.7/site-packages/connexion/decorators/response.py", line 104, in wrapper
    response = function(request)
  File "/usr/lib/python3.7/site-packages/connexion/decorators/decorator.py", line 44, in wrapper
    response = function(request)
  File "/usr/lib/python3.7/site-packages/connexion/decorators/parameter.py", line 219, in wrapper
    return function(**kwargs)
  File "/var/www/entityservice/views/run/description.py", line 38, in delete
    delete_run_data(conn, run_id)
  File "/var/www/entityservice/database/deletion.py", line 23, in delete_run_data
    """, [run_id])
psycopg2.IntegrityError: update or delete on table "runs" violates foreign key constraint "permutation_masks_run_fkey" on table "permutation_masks"
DETAIL:  Key (run_id)=(17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d) is still referenced from table "permutation_masks". 

Worker:

 [2019-09-05 06:30:37,655: INFO/ForkPoolWorker-4] 2019-09-05T06:30:37.655109Z [info     ] Counting contributing parties  [entityservice.utils] 
[2019-09-05 06:30:37,721: INFO/ForkPoolWorker-4] 2019-09-05T06:30:37.721443Z [info     ] Parties where data is ready: 2 [entityservice.utils] 
[2019-09-05 06:30:37,722: INFO/ForkPoolWorker-4] 2019-09-05T06:30:37.722682Z [info     ] 2/2 parties have contributed clks [entityservice.utils] 
[2019-09-05 06:30:37,729: INFO/ForkPoolWorker-4] Task entityservice.tasks.pre_run_check.check_for_executable_runs[84af2b70-1dc5-45aa-9acf-99557b71004d] succeeded in 0.07580971880815923s: None
[2019-09-05 06:30:37,937: INFO/MainProcess] Received task: entityservice.tasks.run.prerun_check[6889779e-8007-45b9-8fbf-68d1ca3571a4]  
[2019-09-05 06:30:38,029: INFO/ForkPoolWorker-6] Task entityservice.tasks.run.prerun_check[6889779e-8007-45b9-8fbf-68d1ca3571a4] succeeded in 0.09161759214475751s: None
[2019-09-05 06:30:38,046: INFO/MainProcess] Received task: entityservice.tasks.comparing.compute_filter_similarity[5937a35e-1470-4b36-ad39-1c463c2240b1]  
[2019-09-05 06:30:38,048: INFO/MainProcess] Received task: entityservice.tasks.base_task.celery_bug_fix[b580b54e-0d08-4b23-a646-70aae4900711]  
[2019-09-05 06:30:38,051: INFO/ForkPoolWorker-2] Task entityservice.tasks.base_task.celery_bug_fix[b580b54e-0d08-4b23-a646-70aae4900711] succeeded in 0.0018211179412901402s: [0, None, None]
[2019-09-05 06:30:38,056: ERROR/ForkPoolWorker-8] Chord callback for 'a4e9c9a2-1d50-4a58-97b1-dfe0a001c179' raised: ChordError("Dependency 5937a35e-1470-4b36-ad39-1c463c2240b1 raised DBResourceMissing('unknown unknown project or run not found in database not found in database not found in database')")
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/var/www/entityservice/tasks/base_task.py", line 87, in __call__
    return super(TracedTask, self).__call__(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/celery/app/trace.py", line 648, in __protected_call__
    return self.run(*args, **kwargs)
  File "/var/www/entityservice/tasks/comparing.py", line 123, in compute_filter_similarity
    raise DBResourceMissing("project or run not found in database")
entityservice.errors.DBResourceMissing: unknown project or run not found in database not found in database
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/celery/backends/redis.py", line 399, in on_chord_part_return
    callback.delay([unpack(tup, decode) for tup in resl])
  File "/usr/lib/python3.7/site-packages/celery/backends/redis.py", line 399, in <listcomp>
    callback.delay([unpack(tup, decode) for tup in resl])
  File "/usr/lib/python3.7/site-packages/celery/backends/redis.py", line 352, in _unpack_chord_result
    raise ChordError('Dependency {0} raised {1!r}'.format(tid, retval))
celery.exceptions.ChordError: Dependency 5937a35e-1470-4b36-ad39-1c463c2240b1 raised DBResourceMissing('unknown unknown project or run not found in database not found in database not found in database')
[2019-09-05 06:30:38,060: INFO/MainProcess] Received task: entityservice.tasks.base_task.on_chord_error[d21e2ea8-e1c0-4148-b058-75275446eac8]  
[2019-09-05 06:30:38,060: INFO/ForkPoolWorker-8] Task entityservice.tasks.comparing.compute_filter_similarity[5937a35e-1470-4b36-ad39-1c463c2240b1] raised expected: DBResourceMissing('unknown project or run not found in database not found in database')
[2019-09-05 06:30:38,063: WARNING/ForkPoolWorker-5] 2019-09-05T06:30:38.063739Z [warning  ] Marked run as failure          [entityservice.tasks]
[2019-09-05 06:30:38,064: INFO/ForkPoolWorker-5] Task entityservice.tasks.base_task.on_chord_error[d21e2ea8-e1c0-4148-b058-75275446eac8] succeeded in 0.0030970340594649315s: None 

hardbyte avatar Sep 05 '19 06:09 hardbyte

Given the pr for supporting run deletion introduced a cascading delete on the permutation mask table, which is on the flask error, it could be the database for the integration tests of the develop branch wasn't cleanly initiated eg due to a persistent volume.

hardbyte avatar Sep 05 '19 10:09 hardbyte