aries-cloudagent-python icon indicating copy to clipboard operation
aries-cloudagent-python copied to clipboard

Fix the rev_list read, increment index and update transaction

Open jamshale opened this issue 6 months ago • 12 comments

So, I'm quite pissed off at myself but I was looking at adding the distributed lock to the old credx issuance and noticed the problem with the index wasn't reproducing.

After studying it further, I noticed that the entire read, index increment and update was happening inside a single transaction while the anoncreds did the read and update in separate transactions. I didn't think doing this with askar prevented another process from reading the rev_list at the same time but apparently that is the case.

The hardest and most consuming part of this was creating the testing for multiple instances which I think will still be valuable going forward with this project.

I'm not sure if the distributed locking mechanism will ever be needed or not. It wasn't too hard to implement and could potentially be useful in the future. It did solve this issue in another way, but I don't think it's required for this problem anymore...

jamshale avatar Jun 17 '25 23:06 jamshale

@ff137 -- what do you think? This would explain why this hadn't been reported before.

The question I have is whether or not we should include the Distributed Lock work that @jamshale worked on in the PR that he closed. It seems to me that it might be a valuable feature. On the other hand, it seems we haven't had a use case for it yet. Should we add it anyway?

Comments on this PR and the second question welcome from others -- @esune @thiagoromanos @andrewwhitehead @dbluhm @cjhowland @mepeltier

swcurran avatar Jun 17 '25 23:06 swcurran

I do think the distributed lock is good in some ways. I still don't know if this got a ton of concurrent issuance requests if it's truly atomic or not. The scenario test can only test concurrency to limited load.

The distributed lock mechanism does add more complexity and the chance of stuck locks, etc, so it's better to avoid it if we don't need it.

I don't think it was a complete waste of time if only to generate the test and think about concurrency in a clustered deployment more.

jamshale avatar Jun 17 '25 23:06 jamshale

I agree with @jamshale, given the discovery and the possibility that wrapping the required steps in a transaction will heavily mitigate or solve the issue I would at least wait and reassess before we implement the additional locking mechanism - the added complexity might not be worth the benefit, if very/too small.

esune avatar Jun 18 '25 14:06 esune

I think we should go ahead with this solution for now and get a release out.

From what I understand is askar transactions aren't atomic for multiple instances and connection pools. So, while this greatly reduces the chance 2 processes grab the same index at the same time it doesn't prevent multiple instances grabbing a transaction at the same time. At large scale this could still occur.

This is such a critical process because it prevents multiple credentials having the same revocation index and also if it fails than the issuance is cancelled. Perhaps for a large scaled deployment having a fully atomic lock is the correct solution.

Still I think this solves the problem for 99.9+% of the time.

jamshale avatar Jun 18 '25 15:06 jamshale

ChatGPT sums this up really well. https://chatgpt.com/s/t_6852e134862081918dd705fd92b46205

jamshale avatar Jun 18 '25 15:06 jamshale

Not really understanding the ChatGPT response in combination with the "this will be good enough for now" assessment. Let's talk about this at Tuesday's ACA-Pug meeting. Not to hold up this PR -- it's necessary and should go ahead, but rather to discuss the larger issue about why this is "good enough" vs. the heavyweight (but not really helpful?) distributed lock solution.

swcurran avatar Jun 18 '25 17:06 swcurran

Sounds good. I think it's more like this is a big improvement over basically what is a bug with it currently. However, the distributed lock is what is needed for a fully scalable heavy load deployment.

The distributed lock is a more complicated feature so getting the bug fixed for a release would probably be a good thing to do while we discuss and fully test out the distributed lock if we decide it is needed.

jamshale avatar Jun 18 '25 17:06 jamshale

Need to look at the failing scenario test. Thought it was working.

jamshale avatar Jun 23 '25 19:06 jamshale

By the way - just mentioning because I'm busy debugging some revocation stuff - it looks like cred_rev_id starts from 1, and this means that the revocation_list resources have an unused slot (the 0th index of the list).

i.e. after revoking some credentials, I see in the logs: revocation_list=[0, 1, 1, 1, ...]. Which makes it look like rev reg size of n is actually n-1, since the first slot is unused.

Not a big deal, but because we're looking at this code now, would it make sense to start counting from 0 instead of 1? Or is it intentional like it is

ff137 avatar Jun 24 '25 13:06 ff137

This is because of the anoncreds revocation algorithm. The 0th index is reserved. It's caused a lot of confusion.

It's different from the legacy indy implementation because it it uses an actual list data type. I think we could change the block size to size + 1 to make it so it actually had the same amount of credentials as you think you are creating in a block.

jamshale avatar Jun 24 '25 14:06 jamshale

I'm having a lot of trouble with github actions and this scenario test. I'm going to get a separate PR open for only the fix to move this forward when I figure out how to fix it.

jamshale avatar Jun 24 '25 17:06 jamshale