reverse-proxy icon indicating copy to clipboard operation
reverse-proxy copied to clipboard

Proxy can divert routes to support A/B testing, service rollout etc

Open Tratcher opened this issue 4 years ago • 15 comments

A/B testing is a common scenario for proxies, letting services incrementally upgrade/experiment using live traffic.

How does a service decide which requests to route to which test group?

  • Assuming all requests are stateless you could decide this randomly on a per-request basis.
  • For a stateful service you'd want to do a random assignment and then use affinization (https://github.com/microsoft/reverse-proxy/issues/45).
  • A service may also decide to statically affinitize based on known groups like tenants (e.g. move small customers before big ones).

Initial theory:

  • This is a separate feature from load balancing. Within a load balancing set we assume all servers are equally able to handle a request aside from considerations of health and load. You'd select a test group and then load balance within that group.
  • This should take advantage of the routing layer. For something like tenant based affinity that information may already be part of the route definition.
  • Alternatively we could put it into the route config (ProxyRoute), allowing it to specify multiple backends and an assignment strategy.
  • This should be compatible with retry mechanisms to allow failing over to the other test group.

Tratcher avatar Apr 28 '20 23:04 Tratcher

@samsp-msft in your surveys have customers explained how they handle this today? And/or how they want to handle it?

Tratcher avatar Apr 28 '20 23:04 Tratcher

Triage: Design should be done in 1.0 - customers will ask for it.

karelz avatar May 14 '20 18:05 karelz

This could also come into play for rolling upgrades: https://github.com/microsoft/reverse-proxy/issues/190#issuecomment-632798853

  1. Remove some nodes from pool1
  2. Shut down those nodes
  3. Upgrade and restart them
  4. Add those nodes to pool2
  5. Gradually transition a percentage of traffic from pool1 to pool2 and check for errors. (A/B testing)

Repeat until all nodes have been upgraded and moved.

Tratcher avatar May 22 '20 16:05 Tratcher

Note: Backed.PartitioningOptions seems to be related to this scenario, but it's not actually implemented anywhere. Implement or remove this option section. The same for quota and circuit breaker options.

Tratcher avatar May 26 '20 21:05 Tratcher

Notes from a partner discussion:

Routing - One Catch-All
 - Partitioning function to select cluster
Clusters, each with one destination
	- Next
	- Current
        - Previous

Tratcher avatar Aug 27 '20 18:08 Tratcher

@johnazariah to your question from: https://github.com/microsoft/reverse-proxy/issues/405#issuecomment-689092153

You could still use the headers to decide where requests get assigned, but it would be applied at a stage after routing and would be much more customizable. Affinity is a definite question, the best option is to have a stable selection algorithm like the headers.

Tratcher avatar Sep 08 '20 19:09 Tratcher

Ok I like what I'm hearing... 👍

johnazariah avatar Sep 08 '20 20:09 johnazariah

Related: partitioning and shuffle sharding, great talk from AWS: https://www.youtube.com/watch?v=swQbA4zub20

Tratcher avatar Nov 19 '20 18:11 Tratcher

E-Core3 is using IHttpProxy directly and does not need anything from us here. There is still general community interest so I'll do some prototyping and see what works.

Tratcher avatar Feb 23 '21 22:02 Tratcher

Triage: We should consider these design points from https://github.com/microsoft/reverse-proxy/issues/809#issuecomment-793233000

Design notes:

  • Clusters already create a logical grouping of destinations, so it would be convenient to use them as the unit of A/B testing. You may also want to change settings like load balancing and protocol versions between groups.
  • In config you'd keep the current cluster id field for simplicity, but add the ability to list alternate clusters by id.
  • These additional clusters would want some metadata associated with them as input into the selection algorithm.
  • ClusterSnapshot and AllDestinations on IReverseProxyFeature are not currently settable.
                    proxyPipeline.Use((context, next) =>
                    {
                        var proxyFeature = context.Features.Get<IReverseProxyFeature>();
                        var route = proxyFeature.RouteSnapshot;
                        // IReadOnlyDictionary<ClusterInfo, IDictionary<string, string>> alternates = route.AlternateClusters;

                        ClusterInfo cluster = PickCluster(context, route);

                        // Update ProxyFeature
                        proxyFeature.ClusterSnapshot = cluster.Config;
                        var state = cluster.DynamicState;
                        proxyFeature.AllDestinations = state.AllDestinations;
                        proxyFeature.AvailableDestinations = state.HealthyDestinations;

                        return next();
                    });

This could be built as a middleware where PickCluster was abstracted out as an interface like we do for load balancing.

karelz avatar Mar 22 '21 19:03 karelz

Triage: Moving to post-1.0 -- we need engaged customers who can validate the design / E2E.

karelz avatar Mar 24 '21 19:03 karelz

Here's a partial proposal for 1.1 that should let people build their own A/B features without us having to be too opinionated on their design:

Routes would need a default cluster to get past this check: https://github.com/microsoft/reverse-proxy/blob/5b205083d25a63cd2c924f38c0baf952dc9e0e05/src/ReverseProxy/Model/ProxyPipelineInitializerMiddleware.cs#L34-L41

In middleware re-assign the cluster: https://github.com/microsoft/reverse-proxy/blob/5b205083d25a63cd2c924f38c0baf952dc9e0e05/src/ReverseProxy/Model/ProxyPipelineInitializerMiddleware.cs#L43-L50

We'd need a new service to get access to the clusters and routes.

  • Read-only lookup by id: TryGetCluster/Route(id, out var cluster/route)
  • enumeration

Tratcher avatar Jan 28 '22 18:01 Tratcher

Can everyone here please take a look at https://github.com/microsoft/reverse-proxy/pull/1538. It doesn't fully implement A/B or related systems, but it does provide some of the raw components to unblock people building their own.

Tratcher avatar Feb 05 '22 01:02 Tratcher

For 2.0 let's consider how to address the most common subset of use cases with extensibility points.

adityamandaleeka avatar May 19 '22 17:05 adityamandaleeka

Triage: Needs Design flash out as next step. Collect requirements, make proposals, discuss them.

karelz avatar Jun 16 '22 16:06 karelz