mimir Helm: replace Grafana Agent Operator with Grafana Alloy

Is your feature request related to a problem? Please describe.

Grafana Agent Operator helm chart ~deprecated~ entered long-term support mode on 9 April 2024 in favor of the new ~Grafana Agent helm chart (with flow)~ Grafana Alloy.

Grafana Agent Operator relied on Prometheus CRDs for monitoring which needed to be installed as an extra step.

Describe the solution you'd like

Use the new ~Grafana Agent helm chart~ Grafana Alloy as subchart of the mimir distributed helm chart.

~Use Agent Flow mode instead of static to be future proof.~

Get rid of ServiceMonitor CRDs in favor of service annotations. (Avoid POD annotations.)

Keep values.yaml intact if possible and backwards compatible.

Describe alternatives you've considered

Removing meta-monitoring altogether from the mimir distributed chart. This would make it harder to demo. Also with CRDs all but gone, the new solution should be easier to use.

Additional context

Add any other context or screenshots about the feature request here.

### Tasks
- [ ] https://github.com/grafana/mimir/issues/6613

Aug 29 '23 13:08 krajorama

Lately we learned of the https://github.com/grafana/k8s-monitoring-helm helm chart, which might be an even better candidate for a subchart / standalone solution.

Oct 05 '23 13:10 krajorama

@krajorama how much time do you imagine this will take? from my understanding, we just need to replace the operator one with the flow mode in our helm chart but I don't know how much this will impact our customers

Oct 06 '23 19:10 zhehao-grafana

@zhehao-grafana I would say about a 2 months. Need some discovery to look into the two options and play around with it. Then implementation, testing and documentation. Ideally it doesn't impact the customer and we can make it backwards compatible, but I doubt it a little bit.

Oct 25 '23 07:10 krajorama

SMEs for this issues going forward are @lamida and @francoposa.

Nov 16 '23 13:11 osg-grafana

notes from a Jan 11th sync with Krajo, in case this gets handed off:

Main requirement: not use classic, but use new flow configuration

compatibility: ideally series & labels produced must be same as now in order to compatible without our own dashboards (and everyone else's). otherwise we could introduce some configuration to transform some metrics as a stopgap or just make it a breaking change? - least ideal

primary use case for metamonitoring: we want community and enterprise customers to send metrics to grafana cloud secondary use case - send it somewhere else or back into Mimir

look at two helm charts (agent chart vs. kubernetes-monitoring-helm chart) - which one is better / more maintained / easier in the long term. Ask maintainers about the future ideas of maintaining the charts and keeping them integrated.

Feb 27 '24 01:02 francoposa

do you guys think it's possible to achieve this without a breaking change in the helm configuration too? I think that would be the best and won't require any migration. Still, requiring the migrate the helm values is better IMO than having to migrate dashboards and alerts if we change the "series & labels produced"

Feb 27 '24 13:02 dimitarvdimitrov

do you guys think it's possible to achieve this without a breaking change in the helm configuration too?

depends what exactly we mean. there will be a different subchart used no matter what, which is an unavoidable change - I am not sure how much a user may experience that.

Ideally the values file would not change but if we end up having to apply any sort of shim or patch to the chart to avoid changes to the values file, I think there will be an argument for just making a breaking change and writing a migration guide (possibly with a deprecation period in between).

But we don't know what changes may be needed until we actually dig in

Feb 27 '24 19:02 francoposa

Now that alloy has replaced grafana agent flow, should this issue be renamed or replaced?

Apr 16 '24 06:04 ron1

@dimitarvdimitrov Thanks for renaming the issue.

May 02 '24 08:05 ron1

Here's a preliminary list of documentation topics that will need to be updated with the new chart:

docs/sources/helm-charts/mimir-distributed/get-started-helm-charts/_index.md
docs/sources/helm-charts/mimir-distributed/run-production-environment-with-helm/_index.md
docs/sources/helm-charts/mimir-distributed/run-production-environment-with-helm/monitor-system-health.md
docs/sources/mimir/manage/monitor-grafana-mimir/collecting-metrics-and-logs.md
docs/sources/mimir/manage/monitor-grafana-mimir/requirements.md

Jul 12 '24 16:07 tacole02