firefly
firefly copied to clipboard
Active-Active High Availability
Background
FireFly v1.1.0 introduced the ability to configure multiple namespaces and connect to multiple blockchains at the same time. This allowed FireFly to scale beyond the constraints of a single blockchain itself. Subsequent releases, including v1.2.0 continued to improve performance and scale. When deployed in a Kubernetes cluster, the current FireFly Core architecture allows for a recovery time of approximately 15 seconds. As we look forward to FireFly v1.3.0 Active-Active High Availability will introduce some meaningful architectural changes, enabling even greater scaling and availability.
Goals
- FireFly Core should support fully Active-Active clusters
- Automatic leader election will determine which runtime is responsible for critical threads where deterministic ordering is required
- This should enable zero downtime failovers
- This should enable increased throughput as requests can be handled by multiple runtimes
- Failover or scaling events should be seamless to client applications connecting through a loadbalancer, with the exception that they may need to reestablish a long lived WebSocket client connection in some cases.
Work Items
This following is a tracking list of the smaller pieces of work that need to be done for active-active high availability of FireFly Core. The list is currently a work in progress, and there will be more items added as they are discovered:
- [x] https://github.com/hyperledger/firefly/issues/1381
- [x] https://github.com/hyperledger/firefly/issues/1382
- [ ] https://github.com/hyperledger/firefly/issues/1383
- [ ] https://github.com/hyperledger/firefly/issues/1384
- [ ] FireFly CLI automation that can create multiple replicas for testing leader election
- [ ] E2E tests that test using the same namespace on multiple replicas
- [ ] E2E tests that create multiple replicas, shut down the leader, test failover, etc.