open-match
open-match copied to clipboard
Multi-tenancy and Partitioning
Is your feature request related to a problem? Please describe. It would be great if open match supported multiple tenants in a single instance. Hosting a single matchmaker that can support multiple totally isolatable games (1) reduces operation cost and overhead, (2) consolidates infrastructure and configuration across games, and (3) simplifies the overhead for hosting users evaluating multiple game design scenarios.
Adding to that, tenancy will likely require adding partitioning to open match. Partitioning in services is a common way to achieve (a) data-level isolation (read: secure) of unrelated application traffic and (b) the possibility for additional data-level horizontal scaling capabilities if vertical performance becomes an issue on a single HA database.
For open match, these would help solve some common issues:
- Improve query performance - Multiple projects sharing an open match instance need to add additional query filters to separate pools during query execution. This is continuous and costly cpu overhead.
- Pass-through tenancy - A user operating a matchmaker for others can easily build a multi-tenant offering to their users.
- Environments - A common need, running a matchmaker which is able to operate on multiple pools of players without overlapping (QA/Test, dev, internal-beta, production)
Describe the solution you'd like I see a few options and tiers of changes are available to achieve different levels of tenancy and partitioning.
Ticket isolation
- Tickets are stored with tenant information (a hash id or prefix-key).
- An identity token needs to be passed to frontdoors so the frontdoor can attach the tenant details.
- ALTERNATIVELY, setting the tenant details could be the responsibility of the calling service
Query-level isolation
- When functions query for tickets, they only receive tickets associated with their tenant
- Requires some kind of identity token be pass by match function to the query service for tenant-verification and query-level tenancy. Idea: The token could be provided by the backend-service during function activation.
- Requires tenancy at the ticket level and query service level.
- One challenge here will be in caching multiple large tenant pools at the query-service level.
Logical partitioning
- Tickets are hashed by tenant into a set of multiple HA db's
- Enables high-scale performance for a large tenant-count environment
- Enables breaking out noisy-neighbor tenants into their own db-level (possibly query-service level)
Application isolation
- Because tenants consist of tickets from non-intersecting sets, the synchronizer context + evaluator no longer need to operate over the entire match proposal pool. Introducing multiple synchronizer-contexts can enable performance isolation between tenant/partition-specific pools.
- Quick explanation for why it enables performance isolation... The synchronizer waits for the slowest function to return proposals (up to a certain time), then stops waiting. This is done in order to allow functions to operate on a single pool of tickets. With multiple logical ticket pools, one tenant may have fast functions, while another is experience an outage. The former would be ineffective in terms of wait-time.
Describe alternatives you've considered
There's another proposal I'm putting together to enable a first class concept for pools
. As a part of that exploration, Pools might enable automatic querying of tickets to be passed into functions, instead of enabling functions to query the tickets for themselves. For some providers, this would remove the need for functions to pass a tenant-identity token to the query service. This is technically more restrictive in that a provider may choose to disable function-query capabilities in favor of Pool-based generation. It has been discussed prior, but may be worth exploring again.
Additional context There is likely an upper limit to what a single open match instance can handle. I think it's worth discussing what our feasible goals are related to tenant volume and maximum throughput (for even a single tenant, let along multiple large tenants or a thousand small ones)
Path to tenant support
There are three layers of multi-tenancy we identified:
- Logical separation
- Performance isolation
- Secure isolation
Logical separation would contain the largest portion of API changes. We could basically keep the internals the same, with the exception of adding a query filter and splitting evaluator calls. Tickets and fetch match calls would require a tenant field. If the empty string is a valid tenant ID, it fits nicely as backward compatible.
Performance isolation can be taken as multiple step by step improvements. Items which would prevent overall performance degradation (eg, limiting number of tickets) would be useful to also prevent one tenant from messing up the others. Obvious targets are:
- splitting query cache instances, so that query can operate independently for each tenant.
- having query instances for specific tenants - either requires an intermediary to route requests, or a mechanism for a match function to be assigned to a specific query instance. This would make instances where the match function is not running inside the same k8s as OM more complicated.
- having multiple synchronizers, with some sort of locking mechanism for which synchronizer handles which tenant. This would also help availability.
- supporting specifying the redis database used for a given tenant, both the database number for the query, and the actual database. This could potentially also tie into improvements where tickets are sharded across multiple redis instances to further improve scale.
- Other quota limits on calls
Secure Isolation means that bad actor tenants can't access other tenant's data. This would require incorporating authentication into all of the OM calls, with a tenant ID and auth tokens. Something like Oauth would work here. Also need to match sure that you're not able to call into other's match functions that OM has access to through the backend.
Alternative
Many aspects of the requirements can actually be solved by using more smaller OM instances. If we made it easier to do that, multi-tenant support is needed less. The remaining significant cost would be extremely small tenants, and the operational cost of hosting many instances.
Other uses
When it comes to performance isolation, that may actually be useful within single games which have totally partitioned matchmaking modes. Eg, if players who are search for Quick Match never will match with players searching for Ranked Play, then having performance isolation would give the matchmaking graceful degradation. In the event that only one mode was performing poorly, the other modes would continue to function. I don't think this is as significant as generally making it easier to avoid performance degradation scenarios, but if we're basically there it'd be nice to add.
Hello,
We are also interested in multiple tenants solution, which could share one redis instance for multiple environments. Our case, we are using managed redis service, and a single game project have many environments(>30), it costs too much.
We also considered to run redis in GKE for these development environments, and use managed redis for production. But project side feel the risk because different redis running in development and production.
i have game. it has 1000 different mini games. how can I make match parallel for these? just asking if above proposal will help me, or other alternative already existing.