aries-cloudagent-python icon indicating copy to clipboard operation
aries-cloudagent-python copied to clipboard

Losing messages in a clustered mediator

Open swcurran opened this issue 1 year ago • 0 comments

The issue is detailed in the [OpenWallet JavaScript Agent] in this comment and in Issue 1625

<tl;dr>:

If you have a clustered environment, it is possible that a message arriving at one instance can be lost even if using a shared queue (e.g. with the redis/kafka plugins). An approach is proposed that requires the use of a 2-phase commit and specific features (notably Change Data Capture (CDC)) in postgres.

From the ticket, a suggestion by @ericvergnaud about the relevant work in OpenWallet JavaScript Agent but are also relevant for ACA-Py and plugins:

I guess a useful thing that can be done:

  • provide a sample plugin that listens to Redis pub/sub
  • provide a sample pgsql cdc that writes to Redis pub/sub (since sqlite cannot act as a server, not sure it makes sense to provide a cdc sample sample for it)

Adapting these for Kafka and AWS EventBridge would also be useful.

Another useful thing is updating the documentation for describing a 'supported' cluster setup.

For NR testing:

  • provide a sample http plugin that listens to http notifications
  • provide a sample event listener that invokes the above http endpoint
  • write a test that:
    • runs 2 mediators with the above plugins and a mediatee listening to instance A
    • sends a forward message M to instance B
    • ensures that the mediatee receives message M

swcurran avatar Jan 02 '24 17:01 swcurran