aries-cloudagent-python icon indicating copy to clipboard operation
aries-cloudagent-python copied to clipboard

:bug: Concurrent issuance of revocable AnonCreds credential can lead to duplicate cred_rev_id and unrevocable credentials

Open ff137 opened this issue 7 months ago • 14 comments

When an issuer attempts to concurrently issue many AnonCreds to a holder, using a revocable credential definition, duplicate cred_rev_id's can be created, leading to revocation of those credentials becoming impossible, as the revoke request will always fail later with a duplicate records error.

The bug is tested for here: https://github.com/didx-xyz/acapy-cloud/pull/1581

In the test we have an issuer, holder, and a revocable cred def. We issue 10 revocable credentials concurrently, and fetch cred_rev records before attempting to revoke. Out of the box, the test fails with (our own log, not acapy's): Duplicate cred_rev_id found: 5 for credential v2-dd01350d-b9b4-498a-b400-6958cf18b549 (Note: it's not 5 duplicates; "5" is the cred_rev_id that is duplicated)

Adding a module-level lock to acapy_agent/anoncreds/revocation.py for the _create_credential method solved the problem for us, and allows above test to pass. (Commit showing module lock: https://github.com/openwallet-foundation/acapy/commit/7c1072c21ae0e78173e2e96c4338e730de3efcd3)

Help is wanted for a cleaner solution, please.

ff137 avatar May 26 '25 10:05 ff137

Any more on this? It is a concerning one. @andrewwhitehead -- any thoughts to add?

~~What I'm not sure of is whether multiple threads are generating RevRegs and getting the same ID (I would have thought a RevReg generation would include generating and returning an ID?) or is it that after generation, multiple threads get the ID and record it in the database?~~

swcurran avatar May 30 '25 18:05 swcurran

Would like to discuss this at the ACA-Py maintainers meeting this coming Tuesday (May 2 -- 9:00 Pacific / 18:00 Central Europe, Meeting Page) -- so would appreciate Maintainers take a look at this before then.

swcurran avatar May 30 '25 18:05 swcurran

I've been thinking about this problem and the lock does make sense for single instance deployments. It could probably be a bit more efficient. However, the problem is with multi-instance deployments there's no way to atomically lock the cred_rev_id index operation.

I don't think askar storage will allow us to do atomic reads. This isn't what it's designed for. However, we do have the conenction_url for the native database, postgres or sqlite.

I think we could make a simple atomic locking service where we provide the rev_reg_def_id to write a value and have atomic reads when the increment process is taking place and then free the value again. Threads on multiple instances would have to wait until it is free and then one would pick it up.

We could do something for sqlite as well but it wouldn't be applicable to clustered deployments.

I haven't been able to think of any other way to solve this other than adding a redis or similar service which I think we want to avoid. Possibly we could even have an additional service to add a type of pub/sub service using postgres a database hooks.

jamshale avatar Jun 05 '25 20:06 jamshale

Plan to have a session and discussion on this and other concurrency issues in ACA-Py at this week's ACA-Pug Meeting. Join Us.

swcurran avatar Jun 06 '25 17:06 swcurran

If standard Redis is too much for solving this, perhaps this Redis fork (Valkey) is appropriate, just for use as a distributed lock manager

https://github.com/valkey-io/valkey-py

Edit: I thought Valkey had some more differences, but it seems to just be Redis without the license changes. I missed all that

ff137 avatar Jun 09 '25 07:06 ff137

We've managed to replace the previous module-level lock (single-instance solution - linked in the OP), with a Valkey lock (multi-instance solution)

  • The "AsyncRedisLock" logic - https://github.com/didx-xyz/acapy/pull/17/files
  • Our new helm files for Valkey, and testing the above changes - https://github.com/didx-xyz/acapy-cloud/pull/1610

This is our new "quick fix", so we can still support high availability.

(note: the acapy change uses redis-py, whereas we deploy valkey -- they are interoperable. Gonna switch to valkey-py for consistency)

ff137 avatar Jun 09 '25 09:06 ff137

My recommendation for a longer-term ACA-Py solution:

  • define a generic class, something like "AcapyLock", which can be initialised with some key (e.g. "cred_rev_id_lock")
    • define __aenter__ and __aexit__ to acquire/release lock, so you can do: async with AcapyLock
  • it uses a DB entry as a default locking method
    • even though it's not what Askar is made for, it's better than a file lock, so shared storage isn't required for multi-instances
  • eventually it can be made configurable to use Redis/Valkey, if available -- or other KV stores like NATS.
    • This custom pluggability can probably be added down the line. With the DB being the quick-fix approach.

Note: Sherlock is a python package which enables distributing locking with multiple backends - may be helpful, although it hasn't been updated in a while.

ff137 avatar Jun 09 '25 09:06 ff137

I think this is a good solution and basically what I was thinking. The only thing I wasn't sure about is if we could really assure atomicity for reads by using a askar DB record.

jamshale avatar Jun 09 '25 16:06 jamshale

It should be possible. Redis/Valkey has the benefit of writing values with expiry times, and returning False if the same key is being set again. That makes a Redis lock mechanism pretty straightforward.

So in Askar - try create a record, have a unique constraint to fail if it already exists, and have some TTL logic to cleanup locks on the off chance a lock isn't removed as expected. Then it's basically the same.

I'm not so clued about the Askar functionality that can solve this out-of-the-box, but there must be a way

ff137 avatar Jun 09 '25 16:06 ff137

A note from a colleague:

PostgreSQL has a locking object thingy Looks like someone made a SQLAlchemy based package for using it for distributed locking in Postgres. Might mean no need to making changes to Askar

I'm not sure how to account for SqlLite or in-memory (if those are supported backends), but thought it's worth sharing

ff137 avatar Jun 09 '25 17:06 ff137

Nice. Would it be worthwhile using Sherlock (likely forking it, since it seems not to be maintained -- and it seem pretty small) and then adding in Postgres (and perhaps Valkey) support? Or is that doing too much to try to be all things to all people?

swcurran avatar Jun 09 '25 17:06 swcurran

I think first is to just get it working with Askar. Which means Postgres - and the other DB backends (is it just Postgres and SQLite?)

What I thought about Sherlock is that it gives some inspiration for defining one class, and making it configurable for diff backends. Don't have to use it, but can borrow from it.

Edit: with the latest AI models, it could be pretty quick to add postgres and maybe sqlite lock support to Sherlock. 👀

ff137 avatar Jun 09 '25 18:06 ff137

It is just postgres and sqlite. I'm not sure what to do about sqlite. Maybe we can just make it a condition that it can only be used for single instance deployments and if/when we have a redis support it could transition to that. I'm actually not sure how askar works for multi-instance sqlite, or if it does, currently.

I agree with basically having our own super lightweight of a locking service like sherlock for use in acapy. Postgres, or askar using postgres, would be the most pressing option to support.

jamshale avatar Jun 09 '25 22:06 jamshale

Related PRs:

  • #3782
  • #3784
  • #3793

ff137 avatar Jun 24 '25 23:06 ff137

This is the reason we manage the cred_rev_id assignment outside of Credo in our deployments. It's basically pushing the problem to the user of the framework, but it's actually quite simple to prevent duplicated identifiers being assigned outside of the framework. We have a postgres database where we increase the current indexCounter by one each time, due to how postgres handles this it will never produce duplicate ids. We then have a constraint that rejects the update if the index counter will exceed the max size of the revocation registry.

We do have a redis lock then to manage the updates to status lists, as well as creation of new revocation registries (to avoid multiple revocation registries being created concurrently for the same cred def for concurrent issuance).

AFAIK Askar does support locking of records (I believe triggered by the for_update when retrieving the record). This locks the record meaning it can solve the problem of concurrent updated to a record (since it seems the index is stored in the record and incremented by 1 each time).

Edit: The locking from Askar uses the native postgres locking, and doesn't seem to work for Askar SQLite: https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-ROWS

TimoGlastra avatar Jun 25 '25 16:06 TimoGlastra

Closing this issue as the specific bug that was discovered (behaviour that changed unexpectedly) has been addressed, as described above. Opening a new issue to raise the question of whether ACA-Py should have some sort of distributed lock capability.

swcurran avatar Jul 24 '25 15:07 swcurran