alloy icon indicating copy to clipboard operation
alloy copied to clipboard

Improve operational confidence

Open tpaschalis opened this issue 1 year ago • 6 comments

We should build on our dogfooding experience to improve the operational confidence of users and help them run large-scale Grafana Agent deployments without worries.

### Tasks
- [ ] Add mixin dashboards for most common component namespaces 
- [ ] Ensure all mixin dashboards are consistent  
- [ ] Create opinionated set of alerts for grafana-agent-flow mixin 
- [ ] Write runbooks for mixin alerts 

tpaschalis avatar Feb 05 '24 19:02 tpaschalis

Grafana Agent is very versatile and can replace most if not all our telemetry needs. Especially Flow makes it seem that you can drop all your use-cases in one river, when in fact I'm starting to feel that you should probably run at least 2 deployments: 1x daemonset, 1x statefulset as some components work better in one deployment vs the other.

Would appreciate some guidance in the docs surrounding this (per component topology recommendation) so others can avoid killing multiple nodes like I did with the default daemonset configuration using prometheus.operator.servicemonitors in the river without host filtering and without limits.

LE: A friend pointed out k8s-monitoring-helm which is a great example of running multiple agent deployments to cover different use-cases.

agologan avatar Feb 19 '24 12:02 agologan

Hi @agologan 👋 Do you mind raising a separate issue please? I suppose the issue should be for enhancing the Deploy doc. Please label is as type/docs so that our docs team can track it. This issue here is not so much for docs - it's more for dashboards, alerts, and runbooks.

ptodev avatar Feb 27 '24 13:02 ptodev

@tpaschalis @rfratto I'd be happy to help with this issue, but I'll need more information on what we need to improve. The issue description is quite broad. I don't know what the highest priority issues are.

ptodev avatar Feb 27 '24 13:02 ptodev

After some careful consideration have marked my above comment as off-topic. While not completely irrelevant, my testimony is not very actionable and wouldn't want to waste maintainers' time with it unless there's widespread indication the docs need improving.

agologan avatar Feb 27 '24 16:02 agologan

I'd be happy to help with this issue, but I'll need more information on what we need to improve. The issue description is quite broad. I don't know what the highest priority issues are.

@ptodev I've added some extra information in a task list, but it is admittedly still vague. I will be spending time soon to create a more concrete list of tasks.

rfratto avatar Feb 27 '24 19:02 rfratto

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

github-actions[bot] avatar Mar 31 '24 00:03 github-actions[bot]