electrum icon indicating copy to clipboard operation
electrum copied to clipboard

Timeout error shadowed by aiorpcx cancellation bug (interface/synchronizer, wallets stuck)

Open MrNaif2018 opened this issue 1 year ago • 8 comments

There is a problem in interface's synchronizer: if we connect to a new server, we first subscribe for notifications using blockchain.scripthash.subscribe, but if server is too slow to respond (longer than default timeout of 30 seconds), then aiorpcx's timeout_after raises it's exception which inherits from CancelledError, which apparently causes async task to be not only not handled, but also cancelled and disappear, which makes synchronizer's self.requested_addrs have some requested addresses in it, but never emptied, causing wallet.is_up_to_date() return False, which will stick until interface is changed (by luck after a few hours or in case of exception which may never be raised)

Reproducer:

diff --git a/aiorpcx/session.py b/aiorpcx/session.py
index efff404..ca96560 100755
--- a/aiorpcx/session.py
+++ b/aiorpcx/session.py
@@ -509,6 +509,8 @@ class RPCSession(SessionBase):
             send_time = await self._send_message(message)
             try:
                 async with timeout_after(self.sent_request_timeout):
+                    if b"blockchain.scripthash.subscribe" in message:
+                        await asyncio.sleep(40)
                     return await future
             finally:
                 time_taken = max(0, time.time() - send_time)

Similar issue we had before: 12e628e2e6eb39709361b587a87e35f291200d16

MrNaif2018 avatar Mar 14 '24 18:03 MrNaif2018

This should be fixed in aiorpcx. There is more discussion in https://github.com/kyuupichan/aiorpcX/pull/48 I think https://github.com/kyuupichan/aiorpcX/commit/e1c7a227f8886972c6d18f920cd5c39462e055d3 is sufficient.

(note: I was playing around a bit in https://github.com/spesmilo/electrum/compare/master...SomberNight:electrum:202403_asyncio_waitfor, e.g. see https://github.com/sombernight/electrum/commit/b388ec3d8caddcc66d69bcd32f6a59ebacf21b16)

SomberNight avatar Mar 15 '24 15:03 SomberNight

By the way there are a few more timeout-related exceptions in aiorpcx which still inherit from CancelledError, they won't cause problems?

MrNaif2018 avatar Mar 15 '24 15:03 MrNaif2018

There is only TimeoutCancellationError, isn't there? I have tried and haven't managed to write a unit test that involved triggering something like this for that. Feel free to try.

SomberNight avatar Mar 15 '24 15:03 SomberNight

It looks like wallet is still not synchronized due to synchronizer.is_up_to_date() being False. It's False once again due to self.requested_addrs being non-empty. So I think what happens is: we now get a timeout, but those addresses which were already requested but failed with a timeout are still left in self.requested_addrs, and for some reason it never gets cleared. I think the Synchronizer task is working, but it's state wasn't cleared during interface change when taskgroup died

MrNaif2018 avatar Mar 20 '24 17:03 MrNaif2018

So I found out this issue is not reproducible on python 3.12, so probably it's still some asyncio bug involved when taskgroup dies still triggering similar issues Will test on 3.11 now to see if it works in 3.11 too

MrNaif2018 avatar Mar 27 '24 11:03 MrNaif2018

It works in 3.11 too. So indeed something similar is involved even after taskgroup has died

MrNaif2018 avatar Mar 28 '24 17:03 MrNaif2018

To be clear, aiorpcx 0.23.1 is sufficient to fix the reproducer from the OP, right? But you are saying your problem is not sufficiently covered by that reproducer?

SomberNight avatar Apr 05 '24 17:04 SomberNight

aiorpcx is sufficient to fix initial reproducer, yes I think Exception is logged correctly, but when taskgroup dies, some logic elsewhere makes it so that the wallet which was behind the died taskgroup does not recover. All other wallets who just receive stop signal recover just fine

MrNaif2018 avatar Apr 06 '24 22:04 MrNaif2018

Well, the example in the OP has been fixed, so I am inclined to close this. You are very welcome to open a new issue if you can somewhat detail the remaining bug or ideally create a new reproducer for it.

SomberNight avatar May 27 '24 15:05 SomberNight