mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Helm: Migration from Mimir to Enterprise - make this more seamless

Open ama-source opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe.

Have Mimir deployed and have ingress enabled through Nginx. Deployed on AWS EKS and have configured Nginx loadbalancer with subnets and ssl settings. When I upgrade to Enterprise Nginx pod and service fail to start and I no longer have ingress to the cluster.

Describe the solution you'd like

GEM is enabled through a flag on the chart, that flag should take into consideration settings ive already applied to the chart and carry them over seamlessly.

ama-source avatar Jun 23 '22 17:06 ama-source

Thanks for filing this issue @ama-source. I know of a few issues right now:

  1. nginx is wholesale replaced with the GEM gateway, this means any ingress you have using the nginx service will stop working after upgrade
  2. nginx and GEM gateway services use different ports, so in addition to changing the service name in use, clients have to set a different port (80 vs 8080).
  3. nginx and GEM gateway had separate ingress configurations, leading to the issue described here.

One option for doing this (discussed in the internal slack) is to have a central ingress configuration that applies to both nginx and the gateway. Then when toggling the enterprise flag we would internally change the service selector for the ingress.

This would fix it for anyone using an ingress, but not for anyone using the k8s services directly.

Logiraptor avatar Jun 23 '22 17:06 Logiraptor

I did a bit of testing

Findings

During my testing only networking was an issue. Other components switched ok from Mimir to GEM.

TL;DR having the chart create one resource (Service, Deployment, Ingress) for each of the gateway and nginx is causing issues.

  • while ingress recreating is causing some downtime, it's fairly short - less than 10s on my k3d cluster with a Traefik ingress controller

    • I couldn't find a way to spread the same traffic over two different k8s services. I don't think this is supported by the kubernetes ingress (SO answer)
    • this means we also need a unified service for nginx and the gateway to have zero-downtime migrations from mimir to GEM
  • what I found to be causing more downtime is the non-graceful replacement of the nginx Deployment with the GEM gateway Deployment. Helm does kubectl delete on the nginx and its pods are gone in a few seconds. In the meantime the gateway takes around 30 seconds to start up

    • pod disruption budgets don't help here, nor do rollout strategies
    • so the solution would be to use the same name for both Deployments. In other words use the same Deployment to run both gateway and nignx. This will allow for rolling rollouts from nginx to the GEM gateway
  • and the third thing is the nuisance of having to copy and adapt the nginx config to the gateway config

    • quite a few options have different names (e.g. podSecurityContext vs securityContext, affinity as a string vs affinity as an object); some options exist for one, but not the other (the gateway cannot be disabled when enterprise == true)

Proposal

I think way forward is to unify the two reverse proxies under one values file section proxy: (naming TBD ofc), one Service, one Ingress, and one Deployment. And deprecate and eventually remove the nginx- and gateway-specific resources.

This is my plan for the migration to that unified proxy:

  • chart update (4.1.0) introduces a new target deployable target in the - proxy - which is nginx when oss, and is GEM gateway when enterprise; most of the config is shared except when it doesn't make sense (e.g. nginx config; nginx basic auth config)
  • 4.1.0 deprecates both gateway and nginx
  • by default in 4.1.0 the chart won't deploy the proxy (replicas == 0)
  • migration path:
    • user is responsible for increasing replicas of proxy and decreasing replicas of nginx/gateway
    • user switches off old ingress and turns on new ingress in one helm upgrade - some downtime with data loss is expected while the controller reconciles the two ingresses (e.g. traefik returns a non-retriable HTTP 404 when doing this switch, hence the data loss)
  • as per deprecation policy (e.g. in 7.0.0) we release a breaking change to remove nginx and gateway values and change default replicas of the proxy to 1 from 0
    • validation is added in that release so that folks using the gateway or nginx values get an error when they try to upgrade

Alternatives

  1. Instead of introducing proxy make the chart recognize which of the nginx or gem gateway has been configured. Use that configuration to configure the single proxy target. This way we don't introduce breaking changes in the config, but downtime is still there because of the two ingresses.

  2. be able to override the names of the services, ingresses and deployments. When the user switches from Mimir to GEM, they will override the name of the gateway resources to match the nginx ones, so helm will do an inplace update instead of a delete and create.

    • this has the advantage of zero-downtime migration between mimir and GEM
    • has the caveat that the selectors in the deployments aren't mutable and they are different between the nginx and the gateway. So this will need a manual non-cascading delete of the deployment and then running helm upgrade
    • so actual migration will be: do change in values.yaml, kubectl delete deployment --cascade=false, helm upgrade

I think going with the breaking change will make the chart more pleasant to use in the long run: more consistent and no cognitive overhead for users when having to reason about two things. It will also keep the implementation of the chart simple, which the first alternative doesn't.

@krajorama @Logiraptor @ama-source can you take a look at what I have in this comment? If you are all ok with this I will start implementing what's outlined in "Proposal"

dimitarvdimitrov avatar Oct 04 '22 11:10 dimitarvdimitrov

I agree we should go with the simpler , more consistent approach suggested by @dimitarvdimitrov . I'd go even further and say that the migration should be like this:

  1. Start new proxy (all replicas)
  2. Some may be pending due to resource constraints, but let's ignore that
  3. Switch over the ingress and also disable nginx/gateway in one go. (New proxies will be able to start automatically as resources are available).

I imagine that switching over the ingress is quite fast so disruption would be minimal.

krajorama avatar Oct 04 '22 12:10 krajorama

What if we re-use the existing gateway section for the new unified reverse proxy? Is it possible then to make this a seamless upgrade for paying customers with a migration path for OSS users? Gateway is potentially a sensible name for the reverse proxy in either case IMO, and it at least prevents some migration headache for some users. Terrible idea?

Logiraptor avatar Oct 11 '22 02:10 Logiraptor

i'll give this a try and update here

dimitarvdimitrov avatar Oct 11 '22 07:10 dimitarvdimitrov

surprisingly to me it was easier than I thought. I've updated the PR to get rid of proxy and just use gateway

dimitarvdimitrov avatar Oct 11 '22 17:10 dimitarvdimitrov