OpenCue icon indicating copy to clipboard operation
OpenCue copied to clipboard

WIP: Cuebot Scalability Change Proposal

Open DiegoTavares opened this issue 4 months ago • 0 comments

Incrementally redesign Cuebot's monolith into multiple services

Motivation

  1. Cuebot's current design doesn't scale well horizontally. Although multiple instances of the service can be load balanced to spread rqd's requests, all instances still rely on a single SQL database that can only scale vertically.
  2. The current design relies heavily on the performance of the DispatchQuery, which is a costly query that degrades according to the size of the frames table.
  3. We received multiple feedbacks from different studios interested in the project that were scared of adding a Java based application to their stack, as java is not commonly used in the VFX/Animation industry.

Current Design challenges

  • rqd's connect directly to cuebot using grpc and this connection is binding until one of them restart, which makes distributing load without outage a challenge.
  • The scheduling logic is implemented as a step on the logic that handles rqd reports. This design makes the process not only hard to maintain, but also creates a coupling that impacts performance. Any step on the report handling that takes longer than anticipated will impact the speed at which frames are booked.
  • Performance inefficiency arises when multiple nodes attempt to book the same layer. Without a global lock mechanism, conflicts are only resolved at the final step of the booking process, preventing a frame from running on multiple hosts.

Constraints

Proposal

DiegoTavares avatar Sep 25 '24 20:09 DiegoTavares