aries-cloudagent-python
aries-cloudagent-python copied to clipboard
Losing messages in a clustered mediator
The issue is detailed in the [OpenWallet JavaScript Agent] in this comment and in Issue 1625
<tl;dr>:
If you have a clustered environment, it is possible that a message arriving at one instance can be lost even if using a shared queue (e.g. with the redis/kafka plugins). An approach is proposed that requires the use of a 2-phase commit and specific features (notably Change Data Capture (CDC)) in postgres.
From the ticket, a suggestion by @ericvergnaud about the relevant work in OpenWallet JavaScript Agent but are also relevant for ACA-Py and plugins:
I guess a useful thing that can be done:
- provide a sample plugin that listens to Redis pub/sub
- provide a sample pgsql cdc that writes to Redis pub/sub (since sqlite cannot act as a server, not sure it makes sense to provide a cdc sample sample for it)
Adapting these for Kafka and AWS EventBridge would also be useful.
Another useful thing is updating the documentation for describing a 'supported' cluster setup.
For NR testing:
- provide a sample http plugin that listens to http notifications
- provide a sample event listener that invokes the above http endpoint
- write a test that:
- runs 2 mediators with the above plugins and a mediatee listening to instance A
- sends a forward message M to instance B
- ensures that the mediatee receives message M