credo-ts icon indicating copy to clipboard operation
credo-ts copied to clipboard

Mediation issues with multiple agents attempting to connect concurrently

Open niall-shaw opened this issue 3 years ago • 2 comments

We are encountering multiple issues (with not an individual identifiable cause) when attempting to initiate mediation with a single mediator agent from up to 30 client agents simultaneously. Mediator agent is running mysql, via vdr-tools (a fork of indy-sdk) binary.

Some of the errors that have been occurring include the following:

  • SIGSEV (Segmentation fault)
  • Heap corruption
  • DEBUG: Request was aborted due to timeout. Not throwing error due to return routing on sent message

We believe one of the causes of this issue is that mysql enables queries to be executed considerably more concurrently than sqlite, and therefore when performing checks like below in parallel, it attempts to create the singleton record numerous times - resulting in a failures.

if (!this._mediatorRoutingRecord) {
            this.agentConfig.logger.debug('Mediator routing record not loaded yet, retrieving from storage');
            let routingRecord = await this.mediatorRoutingRepository.findById(this.mediatorRoutingRepository.MEDIATOR_ROUTING_RECORD_ID);
            // If we don't have a routing record yet, create it
            if (!routingRecord) {
                this.agentConfig.logger.debug('Mediator routing record does not exist yet, creating routing keys and record');
                const { verkey } = await this.wallet.createDid();
                routingRecord = new repository_1.MediatorRoutingRecord({
                    id: this.mediatorRoutingRepository.MEDIATOR_ROUTING_RECORD_ID,
                    routingKeys: [verkey],
                });
                await this.mediatorRoutingRepository.save(routingRecord);
            }
            this._mediatorRoutingRecord = routingRecord;
        }

This is one of these instances that I managed to identify, and I have created a temporary fix - by delaying all other calls of the function by 20ms, therefore allowing for the first saving query to finish execution, see below.

const thisQuery = ++this._totalWaitingQueries
        this.agentConfig.logger.debug('Retrieving mediator routing keys');
        // If the routing record is not loaded yet, retrieve it from storage
        if (thisQuery!==1) {
            await new Promise((resolve) => setTimeout(resolve, 20));
        }

However, this temporary fix is not optimal, as we should not add an arbitrary delay to all further operations.

niall-shaw avatar Jun 22 '22 11:06 niall-shaw

Thanks for opening this issue @niallshaw-absa! I think there's still lots to improve in running AFJ server side.

I'll think about some things we can do to improve this. Do you have any suggestions on how we can best solve this?

TimoGlastra avatar Jun 22 '22 12:06 TimoGlastra

Do you have any suggestions on how we can best solve this?

@TimoGlastra - nothing concrete, potentially a queue system for the queries, but that's just me spitballing

niall-shaw avatar Jun 22 '22 12:06 niall-shaw

Fixed in https://github.com/hyperledger/aries-framework-javascript/pull/985

niall-shaw avatar Aug 19 '22 08:08 niall-shaw