Race condition when setting endorser/author roles on connection record
When setting the connection metadata on a connection record (using the endorse transaction API, particularly the set_endorser_role and set_endorser_info methods), as far as I can tell, the operations seem to queue the setting of metadata as a background task.
Even though the methods return a success response, there is some delay in the connection record actually being updated. This leads to a race condition when trying to add the second set_endorser_role job (e.g. for the author role) too soon after the endorser role has been set (or queued to be set).
Furthermore, trying to read the metadata of the connection record (with get_metadata in the connection API), in order to assert that the endorser role has successfully been set, to make sure that it's now safe to set the author role, doing that without sufficient delay can inadvertently cause the set_endorser_role background task to fail!
This behavior has been the source of much confusion, but fortunately I have managed to create a reproducible example. I'll describe in more detail in the following comment.
Simple description of steps to reproduce:
- Establish a connection between two agents.
- Set the connection metadata (for example, set the endorser role).
- Immediately set the author role from the other agent's perspective (or immediately fetch the connection metadata)
- Notice that the metadata may not yet be updated, and/or raise a duplicate record error.
Possible solution:
- Ensure that metadata writes are completed in the main thread and don't return success/200 until the write operation is fully completed.
- If background processing is necessary, then perhaps just add clearer documentation to indicate that the job is queued. Generally a 202 success code is used to indicate this
Additional context:
- We are using ACA-Py 0.9.0 - although these methods have no changes in later releases (up to 0.10.3)
The following scenario is for onboarding an issuer to our ecosystem. An issuer requires a public DID, and in order for the DID to be written to the ledger, the to-be issuer requires a connection with an endorser, where the connection is configured with endorser/author roles and endorser info, such that the issuer nym can be registered on the ledger.
This is endorsement process is described in the following docs.
Here's a sample method where we set both endorser and author roles on the connection.
Note we are using the aries-cloudcontroller-python project which provides us an AcaPyClient instance with all the ACA-Py API methods.
async def set_endorser_roles(
endorser_controller: AcaPyClient,
issuer_controller: AcaPyClient,
endorser_connection_id: str,
issuer_connection_id: str,
logger: Logger,
):
logger.debug("Setting roles for endorser")
await set_endorser_role(endorser_controller, endorser_connection_id, logger)
logger.debug("Setting roles for author")
await set_author_role(issuer_controller, issuer_connection_id, logger)
logger.debug("Successfully set roles for connection.")
## ^ This method calls the following:
async def set_endorser_role(
endorser_controller: AcaPyClient, endorser_connection_id: str, logger: Logger
):
try:
logger.debug("Setting roles for endorser on endorser-issuer connection.")
await endorser_controller.endorse_transaction.set_endorser_role(
conn_id=endorser_connection_id,
transaction_my_job="TRANSACTION_ENDORSER",
)
logger.debug("Successfully set endorser role.")
await asyncio.sleep(DEFAULT_DELAY) # Allow ACA-Py records to update <---- NB
except ClientResponseError as e:
logger.error("Failed to set endorser role: {}.", e)
async def set_author_role(
issuer_controller: AcaPyClient, issuer_connection_id: str, logger: Logger
):
try:
logger.debug("Setting roles for author on issuer-endorser connection")
await issuer_controller.endorse_transaction.set_endorser_role(
conn_id=issuer_connection_id,
transaction_my_job="TRANSACTION_AUTHOR",
)
logger.debug("Successfully set author role.")
await asyncio.sleep(DEFAULT_DELAY) # Allow ACA-Py records to update <---- NB
except ClientResponseError as e:
logger.error("Failed to set author role: {}.", e)
FYI, the await endorser_controller.endorse_transaction.set_endorser_role method, that is simply making the API call to the set_endorser_role method, on behalf of the endorser.
So, the above code is essentially calling set_endorser_role twice - once from the endorser_controller to set transaction_my_job="TRANSACTION_ENDORSER", and then we call from the issuer_controller to set transaction_my_job="TRANSACTION_AUTHOR".
You'll notice there is a await asyncio.sleep(DEFAULT_DELAY) sleep statement after set_endorser_role is called. This is the necessary workaround to cater for the background task to complete. When working locally, as little as a 0.02 second sleep is sometimes sufficient. When working on k8s, the sleep must be at least 0.2 -- 1 second long for reliable results.
If I set DEFAULT_DELAY = 0, then we can reproduce our error:
2023-10-09 09:22:46.169 | DEBUG | Setting roles for endorser
2023-10-09 09:22:46.169 | DEBUG | Setting roles for endorser on endorser-issuer connection.
2023-10-09 09:22:46.193 | DEBUG | Successfully set endorser role.
2023-10-09 09:22:46.194 | DEBUG | Setting roles for author
2023-10-09 09:22:46.195 | DEBUG | Setting roles for author on issuer-endorser connection
2023-10-09 09:22:46,220 aries_cloudagent.core.dispatcher ERROR Handler error: set_endorser_role
Traceback (most recent call last):
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step
result = coro.send(None)
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/protocols/endorse_transaction/v1_0/routes.py", line 601, in set_endorser_role
jobs = await record.metadata_get(session, "transaction_jobs")
=================
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/connections/models/conn_record.py", line 550, in metadata_get
record = await storage.find_record(
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/storage/askar.py", line 167, in find_record
raise StorageDuplicateError("Duplicate records found")
aries_cloudagent.storage.error.StorageDuplicateError: Duplicate records found
2023-10-09 09:22:46,223 aries_cloudagent.admin.server ERROR Handler error with exception: Duplicate records found
Traceback (most recent call last):
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/admin/server.py", line 173, in ready_middleware
return await handler(request)
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/admin/server.py", line 210, in debug_middleware
return await handler(request)
File "/home/aries/.local/lib/python3.9/site-packages/aiohttp_apispec/middlewares.py", line 45, in validation_middleware
return await handler(request)
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/admin/server.py", line 330, in check_token
return await handler(request)
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/admin/server.py", line 379, in check_multitenant_authorization
return await handler(request)
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/admin/server.py", line 442, in setup_context
return await task
File "/usr/local/lib/python3.9/asyncio/futures.py", line 284, in __await__
yield self # This tells Task to wait for completion.
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
future.result()
File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result
raise self._exception
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step
result = coro.send(None)
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/protocols/endorse_transaction/v1_0/routes.py", line 601, in set_endorser_role
jobs = await record.metadata_get(session, "transaction_jobs")
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/connections/models/conn_record.py", line 550, in metadata_get
record = await storage.find_record(
File "/home/aries/.local/lib/python3.9/site-packages/aries_cloudagent/storage/askar.py", line 167, in find_record
raise StorageDuplicateError("Duplicate records found")
aries_cloudagent.storage.error.StorageDuplicateError: Duplicate records found
Now, we obviously don't want to work with arbitrary sleep durations - there must be a way to assert that the metadata has been set before continuing. So, I found the get_metadata method in the connections API. Alas, we found that trying to read the metadata too soon can also cause the background task to break ... This is a timing issue that cannot always be reproduced locally, but for our deployment on kubernetes, trying to get_metadata without any delay also causes the Duplicate records found error.
To test this out, simply call the get_metadata route immediately after setting endorser role, for the same connection id.
This should be looked at in the AnonCreds RS work that is underway — as we refactor the Endorser handling to be “hidden” from the control — config and process.