firefly icon indicating copy to clipboard operation
firefly copied to clipboard

Active-Active High Availability

Open nguyer opened this issue 1 year ago • 0 comments

Background

FireFly v1.1.0 introduced the ability to configure multiple namespaces and connect to multiple blockchains at the same time. This allowed FireFly to scale beyond the constraints of a single blockchain itself. Subsequent releases, including v1.2.0 continued to improve performance and scale. When deployed in a Kubernetes cluster, the current FireFly Core architecture allows for a recovery time of approximately 15 seconds. As we look forward to FireFly v1.3.0 Active-Active High Availability will introduce some meaningful architectural changes, enabling even greater scaling and availability.

Goals

  • FireFly Core should support fully Active-Active clusters
  • Automatic leader election will determine which runtime is responsible for critical threads where deterministic ordering is required
  • This should enable zero downtime failovers
  • This should enable increased throughput as requests can be handled by multiple runtimes
  • Failover or scaling events should be seamless to client applications connecting through a loadbalancer, with the exception that they may need to reestablish a long lived WebSocket client connection in some cases.

Work Items

This following is a tracking list of the smaller pieces of work that need to be done for active-active high availability of FireFly Core. The list is currently a work in progress, and there will be more items added as they are discovered:

  • [x] https://github.com/hyperledger/firefly/issues/1381
  • [x] https://github.com/hyperledger/firefly/issues/1382
  • [ ] https://github.com/hyperledger/firefly/issues/1383
  • [ ] https://github.com/hyperledger/firefly/issues/1384
  • [ ] FireFly CLI automation that can create multiple replicas for testing leader election
  • [ ] E2E tests that test using the same namespace on multiple replicas
  • [ ] E2E tests that create multiple replicas, shut down the leader, test failover, etc.

nguyer avatar May 24 '23 16:05 nguyer