:bug: Revocation Registry Flow Fails in High Availability Setup Without Shared Storage
In a multi-tenant ACA-Py deployment running on Kubernetes without shared storage between pods, revocation registry/tails file setup fails inconsistently due to a synchronization issue between pods.
Current Environment
- Kubernetes deployment with 3 pods:
- 2 agent pods (
multitenant-agent-785d65f55-5z24k,multitenant-agent-785d65f55-9rgwr) - 1 endorser pod (
governance-agent-7fdd775fd-48wsd)
- 2 agent pods (
- No shared storage between agent pods
Current Flow
When creating a credential definition with revocation enabled, the following sequence occurs:
- Request is received by Pod A (e.g.,
multitenant-agent-785d65f55-5z24k) - Pod A initializes the revocation registry/tails file
- Pod A generates the revocation registry/tails file
- Pod A publishes to ledger, waits for endorsement and acknowledgment
- Pod B (e.g.,
multitenant-agent-785d65f55-9rgwr) receives the endorsed and acknowledged transaction via DIDComm - Pod B attempts to upload revocation registry/tails file to tails server (fails because revocation registry/tails file exists only on Pod A)
- Process fails to complete
Reproduction Steps
- Deploy ACA-Py in a multi-pod Kubernetes environment without shared storage
- Create a credential definition with revocation enabled
- Observe the transaction logs to see that different pods handle different parts of the revocation setup process
- Observe failure when the pod that receives the endorsed transaction doesn't have access to the tails file
Proposed Solution
Reorder the revocation registry/tails file setup flow to ensure the pod with the revocation registry/tails file handles the upload before waiting for ledger transactions to complete:
- Initialize the revocation registry/tails file
- Generate the revocation registry/tails file
- Upload revocation registry/tails file to tails server (moved before ledger publication)
- Publish to ledger, wait for transaction to be endorsed and acknowledged
- Receive endorsed and acknowledged transaction
- Mark revocation registry as active
This ensures that the upload to the tails server is handled by the same pod that generated the revocation registry/tails file, avoiding the need for shared storage between pods.
Questions for Discussion
- Are there any potential race conditions in this revised approach?
- How should error handling work if the upload succeeds but subsequent ledger transactions fail?
- Would this change impact other aspects of the credential issuance or revocation process?
- Are there alternative solutions worth considering?
I’m not the best to answer most of the discussions, but I’m wondering about steps 4 and 5. Those steps are AFAIK Indy specific. Should this process be working at the layer above where there is just the step “Publish to ledger, waiting on the acknowledgment”, where the endorsing and other transaction handling is abstracted to whatever ledger handler is being used. Perhaps this is already there, and perhaps this is not the right Issue to handle that wider issue.
Definitely agree that having the same pod handling the entire process is important.
@swcurran, thank you for the response.
Yeah, I'm not overly familiar with the code and I'm just interpreting what I'm reading in debug logs when running tests trying to figure this out.
A colleague and I have been digging in and we noticed this function and the note about multiple pods:
async def on_revocation_registry_endorsed_event(profile: Profile, event: Event):
"""Handle revocation registry endorsement event."""
meta_data = event.payload
rev_reg_id = meta_data["context"]["rev_reg_id"]
revoc = IndyRevocation(profile)
registry_record = await revoc.get_issuer_rev_reg_record(rev_reg_id)
if profile.settings.get_value("endorser.auto_request"):
# NOTE: if there are multiple pods, then the one processing this
# event may not be the one that generated the tails file.
await registry_record.upload_tails_file(profile)
# Post the initial revocation entry
await notify_revocation_entry_event(profile, registry_record.record_id, meta_data)
# create a "pending" registry if one is requested
# (this is done automatically when creating a credential definition, so that when a
# revocation registry fills up, we can continue to issue credentials without a
# delay)
create_pending_rev_reg = meta_data["processing"].get("create_pending_rev_reg", False)
if create_pending_rev_reg:
endorser_connection_id = (
meta_data["endorser"].get("connection_id", None)
if "endorser" in meta_data
else None
)
await revoc.init_issuer_registry(
registry_record.cred_def_id,
registry_record.max_cred_num,
registry_record.revoc_def_type,
endorser_connection_id=endorser_connection_id,
)
Maybe there's a way to run the await registry_record.upload_tails_file(profile) before doing anything in the ledger?