Mediation issues with multiple agents attempting to connect concurrently
We are encountering multiple issues (with not an individual identifiable cause) when attempting to initiate mediation with a single mediator agent from up to 30 client agents simultaneously. Mediator agent is running mysql, via vdr-tools (a fork of indy-sdk) binary.
Some of the errors that have been occurring include the following:
- SIGSEV (Segmentation fault)
- Heap corruption
DEBUG: Request was aborted due to timeout. Not throwing error due to return routing on sent message
We believe one of the causes of this issue is that mysql enables queries to be executed considerably more concurrently than sqlite, and therefore when performing checks like below in parallel, it attempts to create the singleton record numerous times - resulting in a failures.
if (!this._mediatorRoutingRecord) {
this.agentConfig.logger.debug('Mediator routing record not loaded yet, retrieving from storage');
let routingRecord = await this.mediatorRoutingRepository.findById(this.mediatorRoutingRepository.MEDIATOR_ROUTING_RECORD_ID);
// If we don't have a routing record yet, create it
if (!routingRecord) {
this.agentConfig.logger.debug('Mediator routing record does not exist yet, creating routing keys and record');
const { verkey } = await this.wallet.createDid();
routingRecord = new repository_1.MediatorRoutingRecord({
id: this.mediatorRoutingRepository.MEDIATOR_ROUTING_RECORD_ID,
routingKeys: [verkey],
});
await this.mediatorRoutingRepository.save(routingRecord);
}
this._mediatorRoutingRecord = routingRecord;
}
This is one of these instances that I managed to identify, and I have created a temporary fix - by delaying all other calls of the function by 20ms, therefore allowing for the first saving query to finish execution, see below.
const thisQuery = ++this._totalWaitingQueries
this.agentConfig.logger.debug('Retrieving mediator routing keys');
// If the routing record is not loaded yet, retrieve it from storage
if (thisQuery!==1) {
await new Promise((resolve) => setTimeout(resolve, 20));
}
However, this temporary fix is not optimal, as we should not add an arbitrary delay to all further operations.
Thanks for opening this issue @niallshaw-absa! I think there's still lots to improve in running AFJ server side.
I'll think about some things we can do to improve this. Do you have any suggestions on how we can best solve this?
Do you have any suggestions on how we can best solve this?
@TimoGlastra - nothing concrete, potentially a queue system for the queries, but that's just me spitballing
Fixed in https://github.com/hyperledger/aries-framework-javascript/pull/985