OpenCue
OpenCue copied to clipboard
WIP: Cuebot Scalability Change Proposal
Incrementally redesign Cuebot's monolith into multiple services
Motivation
- Cuebot's current design doesn't scale well horizontally. Although multiple instances of the service can be load balanced to spread rqd's requests, all instances still rely on a single SQL database that can only scale vertically.
- The current design relies heavily on the performance of the
DispatchQuery
, which is a costly query that degrades according to the size of the frames table. - We received multiple feedbacks from different studios interested in the project that were scared of adding a Java based application to their stack, as java is not commonly used in the VFX/Animation industry.
Current Design challenges
- rqd's connect directly to cuebot using grpc and this connection is binding until one of them restart, which makes distributing load without outage a challenge.
- The scheduling logic is implemented as a step on the logic that handles rqd reports. This design makes the process not only hard to maintain, but also creates a coupling that impacts performance. Any step on the report handling that takes longer than anticipated will impact the speed at which frames are booked.
- Performance inefficiency arises when multiple nodes attempt to book the same layer. Without a global lock mechanism, conflicts are only resolved at the final step of the booking process, preventing a frame from running on multiple hosts.