aries-cloudagent-python icon indicating copy to clipboard operation
aries-cloudagent-python copied to clipboard

`Kanon Profile:` aiohttp request timeouts.

Open jamshale opened this issue 1 month ago • 27 comments

When creating scenario tests for the kanon profile I ran into an intermittent issue. The test is disabled here https://github.com/openwallet-foundation/acapy/blob/main/scenarios/examples/kanon_issuance_and_presentation/example.py. What it does is create a kanon issuer, a connection with an askar holder, and issue, verify presentation and revoke a credential. Usually it works locally but sometimes it gets aiohttp timeouts when performing an http request. The problem happens more often when it's run with github actions or a low resource docker container.

The scenario tests quickly create storage records and http requests. Much different then manually creating records through the swagger api. I think there's some type of race condition or blocking issue with the storage/wallet transactions.

This same test passes consistently with askar storage.

Steps to reproduce:

  • Build a local docker image from root directory. docker build -t acapy-test -f ./docker/Dockerfile.run .
  • Uncomment the test in scenarios/examples/kanon_issuance_and_presentation/example.py
  • Run the test. cd scenarios/examples/kanon_issuance_and_presentation and docker compose up --build
  • The test should hang and timeout if you run it enough times. However, sometimes it would pass many times in a row for me.

Here is a fail in github actions https://github.com/openwallet-foundation/acapy/actions/runs/19216759154/job/54927364677.

Note the hang and then aiohttp timeout failure.

Traceback (most recent call last):
  File "/usr/src/app/.venv/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 532, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/usr/src/app/.venv/lib/python3.10/site-packages/aiohttp/streams.py", line 672, in read
    await self._waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/src/app/example.py", line 171, in <module>
    asyncio.run(main())
  File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/src/app/example.py", line 84, in main
    await kanon_postgres.post(
  File "/usr/src/app/.venv/lib/python3.10/site-packages/acapy_controller/controller.py", line 491, in post
    return await self.request(
  File "/usr/src/app/.venv/lib/python3.10/site-packages/acapy_controller/controller.py", line 351, in request
    async with session.request(
  File "/usr/src/app/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 1488, in __aenter__
    self._resp: _RetType = await self._coro
  File "/usr/src/app/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 770, in _request
    resp = await handler(req)
  File "/usr/src/app/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 748, in _connect_and_send_request
    await resp.start(conn)
  File "/usr/src/app/.venv/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 527, in start
    with self._timer:
  File "/usr/src/app/.venv/lib/python3.10/site-packages/aiohttp/helpers.py", line 685, in __exit__
    raise asyncio.TimeoutError from exc_val
asyncio.exceptions.TimeoutError

I'll see if I can get it to fail with debug logs enabled and get any further information.

I believe this problem could reoccur in production if a controller was quickly creating records and using the admin api.

jamshale avatar Nov 10 '25 18:11 jamshale

@vinaysingh8866 FYI. When trying to create integration tests for the kanon profile I ran into this issue.

jamshale avatar Nov 10 '25 18:11 jamshale

@jamshale I have made a PR to fix this, i have uncommented the code in examples to test this. Removed the try catch from handlers, finally block was also committing the transactions. https://github.com/openwallet-foundation/acapy/pull/3950

vinaysingh8866 avatar Nov 13 '25 23:11 vinaysingh8866

Awesome work @vinaysingh8866 -- thank you. I'll update the 1.4.0 change log tomorrow to add the PR and remove the warning.

Good stuff!!

swcurran avatar Nov 14 '25 00:11 swcurran

@vinaysingh8866 Unfortunately after updating main with this fix the scenario tests started failing again, even though they passed multiple times from the PR. This was the intermittent behavior I was seeing before.

It really seemed like you were on to something with the transaction management but the problem does still persist :/ Not sure if you have anymore ideas?

jamshale avatar Nov 14 '25 16:11 jamshale

@jamshale the timeout should be resolved now there was another call that could be causing issue. If it doesn't we need to recreate it in a vm. I'm not able to recreate it on my device now. https://github.com/openwallet-foundation/acapy/pull/3951

vinaysingh8866 avatar Nov 14 '25 22:11 vinaysingh8866

Looks like this is still an intermittent issue. I'm going to need to disable the test again...

The test is definitely passing more often then it did before the latest PR got merged so it does seem like the changes are an improvement. Maybe there is still a related problem somewhere else with the transaction management?

jamshale avatar Nov 18 '25 16:11 jamshale

@jamshale the only remaining deadlocks could be coming from the sessions code will check it out and see if something can be made better. We need someway to recreate this and confirm where the deadlock is happening in single threaded environments.

vinaysingh8866 avatar Nov 18 '25 18:11 vinaysingh8866

I think I may be able to change the scenario docker compose values to limit the agent to a single thread env. I'll try and have a look and see if I can get it to fail more consistently.

jamshale avatar Nov 18 '25 18:11 jamshale

@jamshale i have opened another PR for fix, can you try on this if you're able to recreate the error https://github.com/openwallet-foundation/acapy/pull/3958

vinaysingh8866 avatar Nov 21 '25 14:11 vinaysingh8866

Sounds good 👍 I'll try and test the PR out before approving it.

jamshale avatar Nov 21 '25 16:11 jamshale

It's weird. The nightly runs of the scenario tests were failing over the weekend because of the same timeout issue. PR runs have been passing today though.

I'm really unsure what is different about github actions manifesting the problem super intermittently. I tried running it with my forked repos github action runner many times and the timeout wasn't happening.

I'll try and think of what to do but it seems like the problem remains :/ I'm not sure if it could manifest in a real deployment, but it's still somewhat concerning behavior.

jamshale avatar Nov 24 '25 16:11 jamshale

The code spawns multiple operations to a db for connection cleanup and the containers run on shared instances the performance is not always same some of the them get locked due to only 1 thread. Let me see if how operations can share single thread and happen sequentially.

vinaysingh8866 avatar Nov 24 '25 19:11 vinaysingh8866

@jamshale I have done refactor of the wallet to pass a session so nested calls can reuse the connections. We should run the test actions on this pr like 10 times to sample if the issue is there before merging. https://github.com/openwallet-foundation/acapy/pull/3963

vinaysingh8866 avatar Nov 26 '25 06:11 vinaysingh8866

😢 https://github.com/openwallet-foundation/acapy/actions/runs/19694357044/job/56475672357

Failed again on the sixth run...

jamshale avatar Nov 26 '25 17:11 jamshale

@jamshale i have made changes can you try running the tests again and try to see if it fails

vinaysingh8866 avatar Nov 26 '25 18:11 vinaysingh8866

https://github.com/openwallet-foundation/acapy/actions/runs/19713489277/job/56483331632?pr=3963 Still failing.

jamshale avatar Nov 26 '25 19:11 jamshale

I have refactored the wallet more, now profile and wallet would use the same connections. Lets test again @jamshale 🤞

vinaysingh8866 avatar Nov 26 '25 20:11 vinaysingh8866

it failed again we're missing something 🤔

vinaysingh8866 avatar Nov 26 '25 20:11 vinaysingh8866

https://github.com/openwallet-foundation/acapy/actions/runs/19716348019/job/56490968052?pr=3963 :/

jamshale avatar Nov 26 '25 20:11 jamshale

I'll see if there is another place that is creating more connections lets give it one more try today to fix @jamshale

vinaysingh8866 avatar Nov 26 '25 21:11 vinaysingh8866

@jamshale can you run the tests again

vinaysingh8866 avatar Nov 26 '25 21:11 vinaysingh8866

Still failing... Must be somewhere else creating connections. I'll try and find some time to see if I can spot anything.

jamshale avatar Nov 26 '25 22:11 jamshale

I have added explicit closing code for the scans can you test it and see if it helps.

vinaysingh8866 avatar Nov 27 '25 06:11 vinaysingh8866

@swcurran I have tried various things to fix this, its seems like a issue with how kanon uses both askar and postgres connections it might not work reliably on single thread. Someone with some devops background should look at this and see the postgres logs on the timeout what state its at on failure.

vinaysingh8866 avatar Nov 27 '25 21:11 vinaysingh8866

Can you post the most recent logs here that provide the relevant information. I'll see if I can find someone to look into this. Thanks.

swcurran avatar Nov 27 '25 21:11 swcurran

Thats the issue we're not able to recreate this locally and the actions are not showing logs of the postgres and the failure just shows as timeout. I have added more logging in the latest commits and an async context manager for the connections. Its not failing for me now running it more times is needed to check the current state i don't have access to do that.

vinaysingh8866 avatar Nov 28 '25 00:11 vinaysingh8866

When I run the test on the latest PR locally (with Docker on a Mac), the verifier is consistently failing when verifying the first proof (line 144 of example.py). The controller then exits with a code 1. I'm not sure if that would lead to the timeout error seen on the CI, as unfortunately it doesn't seem to capture the controller's logs there.

If I disable revocation support (adjust the credential definition creation and remove the revoke operation) then the controller completes with a code 0, although the other containers keep running. That seems to be normal for local operation?

My theory is that the backend is not actually waiting for the revocation setup to be completed, although "wait_for_revocation_setup": true is passed, and so the verifier cannot verify the non-revocation, but I haven't confirmed that yet.

andrewwhitehead avatar Dec 05 '25 18:12 andrewwhitehead

@vinaysingh8866 The scenario tests have been changed to log the docker output. It is quite a lot, you need to sift through it to find the kanon test.

With the new configurations the agent is getting a different error now on startup. I think it has something to do with the normaliztion step.

2025-12-17T00:42:14.9033375Z kanon-postgres-1  | 2025-12-17 00:42:14,244 acapy_agent.database_manager.databases.postgresql_normalized.handlers.generic_handler DEBUG [fetch] Starting with profile_id=1, category=config, name=default_public_did, tag_filter=None, for_update=False, tags_table=postgres.items_tags
2025-12-17T00:42:14.9036065Z kanon-postgres-1  | 2025-12-17 00:42:14,245 acapy_agent.database_manager.databases.postgresql_normalized.session.DBStore ERROR Failed to fetch item 'default_public_did' in category 'config': relation "postgres.items" does not exist
2025-12-17T00:42:14.9037655Z kanon-postgres-1  | LINE 2:             SELECT id, value FROM postgres.items
2025-12-17T00:42:14.9038338Z kanon-postgres-1  |                                           ^
2025-12-17T00:42:14.9039608Z kanon-postgres-1  | 2025-12-17 00:42:14,245 acapy_agent.database_manager.dbstore ERROR fetch error: Failed to fetch item 'default_public_did' in category 'config'
2025-12-17T00:42:14.9042088Z kanon-postgres-1  | 2025-12-17 00:42:14,245 acapy_agent.database_manager.databases.postgresql_normalized.backend DEBUG Translating error: Failed to fetch item 'default_public_did' in category 'config', type=<class 'acapy_agent.database_manager.databases.errors.DatabaseError'>
2025-12-17T00:42:14.9043991Z kanon-postgres-1  | 2025-12-17 00:42:14,245 acapy_agent.commands.start INFO Shutting down
2025-12-17T00:42:14.9044938Z kanon-postgres-1  | 2025-12-17 00:42:14,245 acapy_agent.core.conductor INFO Stopping the Conductor agent.
2025-12-17T00:42:14.9046138Z kanon-postgres-1  | 2025-12-17 00:42:14,245 acapy_agent.core.conductor DEBUG Waiting for shutdown tasks to complete with timeout=30.000000.
2025-12-17T00:42:14.9047375Z kanon-postgres-1  | 2025-12-17 00:42:14,245 acapy_agent.core.conductor INFO Conductor agent stopped successfully.
2025-12-17T00:42:14.9048489Z kanon-postgres-1  | 2025-12-17 00:42:14,245 acapy_agent.commands.start ERROR Unexpected exception during execution
2025-12-17T00:42:14.9049540Z kanon-postgres-1  | Traceback (most recent call last):
2025-12-17T00:42:14.9050765Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/database_manager/databases/postgresql_normalized/session.py", line 381, in fetch
2025-12-17T00:42:14.9051815Z kanon-postgres-1  |     result = await handler.fetch(
2025-12-17T00:42:14.9052421Z kanon-postgres-1  |              ^^^^^^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9053534Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/database_manager/databases/postgresql_normalized/handlers/generic_handler.py", line 320, in fetch
2025-12-17T00:42:14.9054700Z kanon-postgres-1  |     await cursor.execute(query, params)
2025-12-17T00:42:14.9056089Z kanon-postgres-1  |   File "/root/.cache/pypoetry/virtualenvs/acapy-agent-VA82Wl8V-py3.12/lib/python3.12/site-packages/psycopg/cursor_async.py", line 117, in execute
2025-12-17T00:42:14.9057361Z kanon-postgres-1  |     raise ex.with_traceback(None)
2025-12-17T00:42:14.9058167Z kanon-postgres-1  | psycopg.errors.UndefinedTable: relation "postgres.items" does not exist
2025-12-17T00:42:14.9059252Z kanon-postgres-1  | LINE 2:             SELECT id, value FROM postgres.items
2025-12-17T00:42:14.9059872Z kanon-postgres-1  |                                           ^
2025-12-17T00:42:14.9060341Z kanon-postgres-1  | 
2025-12-17T00:42:14.9060941Z kanon-postgres-1  | During handling of the above exception, another exception occurred:
2025-12-17T00:42:14.9061609Z kanon-postgres-1  | 
2025-12-17T00:42:14.9062038Z kanon-postgres-1  | Traceback (most recent call last):
2025-12-17T00:42:14.9062852Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/database_manager/dbstore.py", line 637, in fetch
2025-12-17T00:42:14.9063693Z kanon-postgres-1  |     return await self._db_session.fetch(
2025-12-17T00:42:14.9064284Z kanon-postgres-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9065299Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/database_manager/databases/postgresql_normalized/session.py", line 410, in fetch
2025-12-17T00:42:14.9066305Z kanon-postgres-1  |     raise DatabaseError(
2025-12-17T00:42:14.9067360Z kanon-postgres-1  | acapy_agent.database_manager.databases.errors.DatabaseError: Failed to fetch item 'default_public_did' in category 'config'
2025-12-17T00:42:14.9068454Z kanon-postgres-1  | 
2025-12-17T00:42:14.9069204Z kanon-postgres-1  | During handling of the above exception, another exception occurred:
2025-12-17T00:42:14.9069895Z kanon-postgres-1  | 
2025-12-17T00:42:14.9070321Z kanon-postgres-1  | Traceback (most recent call last):
2025-12-17T00:42:14.9071162Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/storage/kanon_storage.py", line 89, in _get_record
2025-12-17T00:42:14.9071945Z kanon-postgres-1  |     item = await self._call_handle_or_session(
2025-12-17T00:42:14.9072562Z kanon-postgres-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9073343Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/storage/kanon_storage.py", line 406, in _call_handle_or_session
2025-12-17T00:42:14.9074289Z kanon-postgres-1  |     return await smethod(*args, **kwargs)
2025-12-17T00:42:14.9074991Z kanon-postgres-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9075732Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/database_manager/dbstore.py", line 644, in fetch
2025-12-17T00:42:14.9076555Z kanon-postgres-1  |     raise self._db_session.translate_error(e)
2025-12-17T00:42:14.9077363Z kanon-postgres-1  | acapy_agent.database_manager.error.DBStoreError: Query error
2025-12-17T00:42:14.9078103Z kanon-postgres-1  | 
2025-12-17T00:42:14.9078710Z kanon-postgres-1  | The above exception was the direct cause of the following exception:
2025-12-17T00:42:14.9079680Z kanon-postgres-1  | 
2025-12-17T00:42:14.9080121Z kanon-postgres-1  | Traceback (most recent call last):
2025-12-17T00:42:14.9080895Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/commands/start.py", line 129, in execute
2025-12-17T00:42:14.9081667Z kanon-postgres-1  |     asyncio.run(run_app(argv))
2025-12-17T00:42:14.9082959Z kanon-postgres-1  |   File "/root/.cache/pypoetry/virtualenvs/acapy-agent-VA82Wl8V-py3.12/lib/python3.12/site-packages/nest_asyncio.py", line 30, in run
2025-12-17T00:42:14.9084172Z kanon-postgres-1  |     return loop.run_until_complete(task)
2025-12-17T00:42:14.9084797Z kanon-postgres-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9086015Z kanon-postgres-1  |   File "/root/.cache/pypoetry/virtualenvs/acapy-agent-VA82Wl8V-py3.12/lib/python3.12/site-packages/nest_asyncio.py", line 98, in run_until_complete
2025-12-17T00:42:14.9087214Z kanon-postgres-1  |     return f.result()
2025-12-17T00:42:14.9087683Z kanon-postgres-1  |            ^^^^^^^^^^
2025-12-17T00:42:14.9088376Z kanon-postgres-1  |   File "/usr/local/lib/python3.12/asyncio/futures.py", line 202, in result
2025-12-17T00:42:14.9089601Z kanon-postgres-1  |     raise self._exception.with_traceback(self._exception_tb)
2025-12-17T00:42:14.9090668Z kanon-postgres-1  |   File "/usr/local/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
2025-12-17T00:42:14.9091568Z kanon-postgres-1  |     result = coro.send(None)
2025-12-17T00:42:14.9092088Z kanon-postgres-1  |              ^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9092819Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/commands/start.py", line 120, in run_app
2025-12-17T00:42:14.9093595Z kanon-postgres-1  |     await start_app(conductor)
2025-12-17T00:42:14.9094355Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/commands/start.py", line 31, in start_app
2025-12-17T00:42:14.9095118Z kanon-postgres-1  |     await conductor.setup()
2025-12-17T00:42:14.9095865Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/core/conductor.py", line 163, in setup
2025-12-17T00:42:14.9096847Z kanon-postgres-1  |     self.root_profile, self.setup_public_did = await wallet_config(context)
2025-12-17T00:42:14.9097694Z kanon-postgres-1  |                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9098548Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/config/wallet.py", line 196, in wallet_config
2025-12-17T00:42:14.9099656Z kanon-postgres-1  |     public_did_info = await wallet.get_public_did()
2025-12-17T00:42:14.9100360Z kanon-postgres-1  |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9113446Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/wallet/kanon_wallet.py", line 664, in get_public_did
2025-12-17T00:42:14.9114515Z kanon-postgres-1  |     return await self._get_public_did_impl(session)
2025-12-17T00:42:14.9115243Z kanon-postgres-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9116178Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/wallet/kanon_wallet.py", line 674, in _get_public_did_impl
2025-12-17T00:42:14.9117181Z kanon-postgres-1  |     public_item = await storage.get_record(
2025-12-17T00:42:14.9117850Z kanon-postgres-1  |                   ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9118887Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/storage/kanon_storage.py", line 83, in get_record
2025-12-17T00:42:14.9120301Z kanon-postgres-1  |     return await self._get_record(record_type, record_id, for_update, session)
2025-12-17T00:42:14.9121120Z kanon-postgres-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-17T00:42:14.9121963Z kanon-postgres-1  |   File "/usr/src/app/acapy_agent/storage/kanon_storage.py", line 93, in _get_record
2025-12-17T00:42:14.9122952Z kanon-postgres-1  |     raise StorageError("Error when fetching storage record") from err
2025-12-17T00:42:14.9123926Z kanon-postgres-1  | acapy_agent.storage.error.StorageError: Error when fetching storage record
2025-12-17T00:42:14.9125178Z kanon-postgres-1  | 2025-12-17 00:42:14,249 aries_askar.native.aries_askar.ffi.store DEBUG 	src/ffi/store.rs:1211 | Closed session SessionHandle(1)
2025-12-17T00:42:14.9126531Z kanon-postgres-1  | 2025-12-17 00:42:14,249 aries_askar.native.aries_askar.ffi.log DEBUG 	src/ffi/log.rs:130 | Removing custom logger
2025-12-17T00:42:14.9127893Z kanon-postgres-1  | 2025-12-17 00:42:14,249 aries_askar.native.aries_askar.ffi.store DEBUG 	src/ffi/store.rs:603 | Closed store StoreHandle(1)
2025-12-17T00:42:14.9129749Z kanon-postgres-1  | 2025-12-17 00:42:14,251 aries_askar.native.askar_storage.backend.db_utils DEBUG 	askar-storage/src/backend/db_utils.rs:177 | Dropped transaction: roll-back
2025-12-17T00:42:14.9133483Z kanon-postgres-1  | KanonProfileManager Provision store with config: {'key': 'insecure', 'name': 'kanon-postgres-normalized', 'storage_config': '{"url":"wallet-db:5432","max_connections":100,"min_idle_count":5,"max_idle":10.0,"max_lifetime":7200.0}', 'storage_creds': '{"account":"myuser","password":"mypass"}', 'storage_type': 'postgres_storage', 'dbstore_storage_type': 'postgres_storage', 'dbstore_storage_config': '{"url":"wallet-db:5432","connection_timeout":30.0,"max_connections":100,"min_idle_count":5,"max_idle":10.0,"max_lifetime":7200.0}', 'dbstore_storage_creds': '{"account":"myuser","password":"mypass"}', 'dbstore_schema_config': 'normalize', 'auto_recreate': True}

jamshale avatar Dec 17 '25 00:12 jamshale