Preconfigured and all-in-one LGTM stack helm chart
Grafana has managed to offer a complete observability stack with all its different solutions. Thanks a lot for that!
However, the different applications have to be deployed individually and it isn't always easy to configure these applications so that they all work well together.
Therefore, I am currently looking for a helm chart that can be used to deploy all applications of the LGTM (Loki, Grafana, Tempo, Mimir) stack plus all other dependencies (e.g. like Prometheus, Minio, etc.) all at once. It would be great if this chart included an opinionated pre-configuration of all applications, so that the whole LGTM stack (+dependencies) works out-of-the-box.
Here is the overview of the LGTM stack on the Grafana GitHub organisation homepage:
[...]
LGTM
We have many projects, but we want to highlight a few, in a quite specific order:
- Loki, like Prometheus, but for logs.
- Grafana, the open and composable observability and data visualization platform.
- Tempo, a high volume, minimal dependency distributed tracing backend.
- Mimir, the most scalable Prometheus backend.
I have also looked at and tried the Grafana Agent and Grafana Agent Operator (which promises similar things), but it only deploys the Grafana Agent, but not the entire LGTM stack.
The closest example I found is this docker-compose.yml in the Mimir tutorials. But it's just a docker-compose config file (rather than a helm chart) and it doesn't include Loki and Tempo (because it's just an example).
Does such a helm chart already exist or is it in the pipeline? Or do you know of any other open source repository that contains a similar helm chart?
If not, is there a reason why an all-in-one LGTM (+dependencies) stack helm chart doesn't exist and isn't planned?
Edit (Oct 4, 2023)
To quote my own comment from down here:
There's one other Grafana project that would complement this stack chart quite well: Grafana OnCall I didn't include it when I created this issue because it wasn't available at that point.
@krajorama @trevorwhitney @joe-elliott @jdbaldry Is this the correct repository to ask that kind of question or is there a better repo for this issue?
BTW: After browsing through the issues in this repo I realized that a lot of them would be resolved immediately by having a preconfigured all-in-one helm chart.
Hi @mamiu, thanks for creating this issue.
We do not currently have an all-in-one chart in progress but I believe you are right that it would solve a number of our outstanding issues and also provide a really nice user experience for playing with the whole suite of Grafana projects.
I have raised this in our internal working group and will get you a comprehensive answer to the rest of your questions.
@jdbaldry Thanks a lot for taking on this issue and for raising it in Grafana's internal working group.
I'm back with a small update.
I've had a brief chat with some of the members of our Helm working group and we all agree this would be a valuable addition. The primary reason we have not started this yet is that we are keen to dogfood our work to ensure that it works properly and our internal architecture differs enough that we wouldn't be able to easily integrate the all-in-one chart for running Grafana Cloud services.
The working group will definitely keep this on the agenda.
Thanks for the update @jdbaldry.
That'd be amazing! If a separate Kubernetes cluster (for development and testing purposes) would be beneficial for you guys, I'm happy to sponsor a three node cluster for this (each node has 6 vCPU Cores, 16GB RAM, 100GB NVMe Storage, 1 Gbit/s Bandwidth). I can give you either SSH access to the nodes or if you don't want to manage the Kubernetes installation yourself, I can do this for you and would give you access to a newly setup cluster. No one else will use it (or get access to it) while you are working on it. Would that be beneficial for you to build a generic all-in-one chart?
Any updates on this? I would like to PoC the stack and it would be very easy with a all-encompassing chart.
There is no current plan to work on this. If a community member would like to build this and PR it to this repo I would be happy to take a look. I think to be most successful it should use grafana/mimir-distributed source, grafana/loki source, and grafana/tempo-distributed source, as those are our most actively maintained charts for each database. I would also recommend the addition of grafana/agent-operator, as both grafana/mimir-distributed and grafana/loki have agent operator integrations.
I took a quick stab at this just to get a feeling for what needs to be done, so far I've found:
You can combine the following helm charts
- grafana/mimir-distributed
- grafana/loki-distributed
- grafana/tempo-distributed
- grafana/grafana
The first thing I noticed is the naming, you might want to set a fullNameOverride to change loki-distributed to loki for example.
Out of the box the grafana helm chart doesn't have datasources for mimir/loki/tempo, so including those in the top level chart that combines them would be nice. I think the loki stack chart has some already, so they could be reused.
Both Mimir and Tempo deploy their own Minio, ideally deploying minio would be set to false for those two, then include minio in the top level LGTM stack chart.
Mimir deploys in multi-tenant mode by default, so that either needs to be false, or, the data source needs to include the x-scope-orgId header.
I'll add more as I learn more
How far did you get with this @edude03? Working on this anywhere we can collab?
Any updates on this? :)
I've been playing with the LGTM stack on my personal cluster, the current code could be of interest for this issue.
Things to note though:
- I'm using kustomize to combine helm charts and add ad-hoc manifests, but this should be straightforward to adapt into a new chart
- The code is a big mess while I'm hacking on it
- For now, things are tailored for a dev deployment. I think a united chart would only make sense for such deployments though, as a production setup would require to invest time in the specific products and tailor them as needed.
- I'm using community dashboards instead of grafana cloud, so some relabels are in place to make the output look like a kube-prometheus stack for compatibility.
- As of now I've been using the grafana agent operator, but with the 0.33 release I'm eyeing at replacing it with the flow mode instead
If there is interest, I'd be very happy to contribute to this problem!
@Chewie this is really helpful, thank you so much for sharing it.
I've been playing with the LGTM stack on my personal cluster, the current code could be of interest for this issue.
Things to note though:
- I'm using kustomize to combine helm charts and add ad-hoc manifests, but this should be straightforward to adapt into a new chart
- The code is a big mess while I'm hacking on it
- For now, things are tailored for a dev deployment. I think a united chart would only make sense for such deployments though, as a production setup would require to invest time in the specific products and tailor them as needed.
- I'm using community dashboards instead of grafana cloud, so some relabels are in place to make the output look like a kube-prometheus stack for compatibility.
- As of now I've been using the grafana agent operator, but with the 0.33 release I'm eyeing at replacing it with the flow mode instead
If there is interest, I'd be very happy to contribute to this problem!
Cool! I think you would love helmfile. It supports secrets (sops, vault, etc), kustomize "as charts", patches, etc.
I have started working on an LGTM stack helm chart for my own project, happy to raise a PR in this repo if that would be helpful as a starting point
The PR is now merged and the distributed chart should be available now. Might have a look at making another one in a bit using the non-distributed charts for simpler use cases.
UPD: Loki and Tempo have the non-distributed versions of the charts, but Mimir by its nature is run as a set of microservices. If we want the simplest deployment the fewer pods than the distributed charts generate, then we could replace Mimir with Prometheus, but we cannot call it LGTM at that point. If anyone has use cases out there, let me know.
@timberhill Thanks a lot for creating this LGTM stack helm chart!
There's one other Grafana project that would complement this stack chart quite well: Grafana OnCall I didn't include it when I created this issue because it wasn't available at that point.
If you have the time and also think it would be useful for most people, it would be great if you could include it in your chart (even if it's disabled by default).
Great work @timberhill. I was looking for something like this which would allow me to install the lgtm stack as seamlessly as possible.
Didn't find a spot-on channel on Slack so I'm asking here:
Is this supposed to run standalone or do I need to install somethings else (like Prometheus, grafana-agent etc) to start ingesting logs and metrics? When trying the chart I can see that it provisions me the grafana-agent-operator so my first thought was that agent pods would start popping up on the nodes in the cluster but they don't and I can't access any logs in Grafana.
@mamiu yeah could do, should be fairly easy.
@winterrobert both mimir-distributed and tempo-distributed install grafana agent operator for self monitoring by default. Personally I would prefer to keep the stack lean and only provide endpoints for data ingest without any data collection, but will need to dig deeper into why those charts are set up the way they are before doing so. As it stands, this chart is using the defaults set by Grafana.
Thanks @timberhill.
When trying to get the grafana-agent to work the lgtm-distributed chart, what should the URLs be for my LogsInstance and MetricsInstance to ship logs and metrics to Loki and Mimir?
apiVersion: monitoring.grafana.com/v1alpha1
kind: LogsInstance
metadata:
name: primary
labels:
agent: grafana-agent-logs
spec:
clients:
- url: "lgtm-distributed-loki-ingester.lgtm-distributed.svc.cluster.local:3100/api/v1/push?"
.....
apiVersion: monitoring.grafana.com/v1alpha1
kind: MetricsInstance
metadata:
name: primary
labels:
agent: grafana-agent-metrics
spec:
remoteWrite:
- url: "http://lgtm-distributed-mimir-nginx.lgtm-distributed.svc.cluster.local/api/v1/push?"
....
Good question, the docs are not really covering it very well. From what I understand, it's the distributor that is getting the metrics in, so you should use the -mimir-ingester-headless service for metrics. For Loki it must be -loki-distributor service.
Let me know about your tests there, that would be very much appreciated - I haven't used Grafana Agent before and need to figure it out first :)
I noticed that in the mimir-distributed chart, the nginx usage is deprecated in favour of using the gateway (https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L2123-L2128).
But the grafana chart has the nginx service name hardcoded (https://github.com/grafana/helm-charts/blob/main/charts/lgtm-distributed/values.yaml#L22).
Would it be worth (/possible?) to make the grafana chart use the correct service name (http://{{ .Release.Name }}-mimir-nginx/prometheus if nginx: enabled: true or http://{{ .Release.Name }}-mimir-gateway/prometheus if gateway: enabledNonEnterprise: true)?
Alternatively/In the mean time (just noted in the migration documentation) we can add a "legacy" service that always names itself {{ .Release.Name }}-mimir-nginx even if the gateway is used.
I see that this is possible by using the nameOverride property (via https://grafana.com/docs/helm-charts/mimir-distributed/latest/migration-guides/migrate-to-unified-proxy-deployment/):
# Migrate to gateway
gateway:
enabledNonEnterprise: true
service:
nameOverride: mimir-nginx
nginx:
enabled: false
Would love to see a solid way of making Grafana Agent in flow mode part of this chart. Happy to contribute if others think this would make sense to do.
I've been attempting to use the lgtm-distributed chart to get everything running on AKS. The installation goes fine and everything is up and running. Where I'm having trouble is getting an ingress configured. No matter what I've tried I have not been successful getting external access working. Admittedly, I'm barely proficient with kubernetes and helm, but since this chart is designed to work out of the box it would be helpful if the readme included examples of getting external access working.
@tjdavis3 Hi, Could you help with the steps you followed to deploy LGTM distributed helm chart in Kubernetes.
@Nikhil-Devisetti I created a values file from https://github.com/grafana/helm-charts/blob/main/charts/lgtm-distributed/values.yaml and installed by calling helm with -f values.yml. My values configures Azure blob storage as the backend and uses OAuth for authentication. Here's what a sanitized version looks like (I removed our domains, access keys, etc).
---
minio:
enabled: false
grafana:
# -- Deploy Grafana if enabled. See [upstream readme](https://github.com/grafana/helm-charts/tree/main/charts/grafana#configuration) for full values reference.
enabled: true
image:
tag: 10.2.2
ingress:
enabled: true
ingressClassName: nginx
# Values can be templated
annotations:
cert-manager.io/cluster-issuer: letsencrypt
acme.cert-manager.io/http01-edit-in-place: "true"
labels: {}
path: /
# pathType is only for k8s >= 1.1=
pathType: Prefix
hosts:
- lgtm.example.com
tls:
- secretName: grafana-tls
hosts:
- lgtm.example.com
grafana.ini:
server:
root_url: https://lgtm.example.com
auth.generic_oauth:
enabled: true
name: RingSquared
allow_sign_up: true
client_id: LGTM
client_secret: ........-....-....-....-..........
scopes: openid email profile offline_access roles
# - openid
# - email
# - profile
# - offline_access
# - roles
email_attribute_path: email
login_attribute_path: preferred_username
name_attribute_path: full_name
auth_url: https://auth.example.com/auth/realms/RealmName/protocol/openid-connect/auth
token_url: https://auth.example.com/auth/realms/RealmName/protocol/openid-connect/token
api_url: https://auth.example.com/auth/realms/RealmName/protocol/openid-connect/userinfo
role_attribute_path: contains(realm_access.roles[*], 'grafanaadmin') && 'GrafanaAdmin' || contains(realm_access.roles[*], 'admin') && 'Admin' || contains(realm_access.roles[*], 'grafana-editor') && 'Editor' || 'Viewer'
allow_assign_grafana_admin: true
auto_assign_org_role: Editor
ldap:
enabled: false
# `config` is the content of `ldap.toml` that will be stored in the created secret
config: |-
[[servers]]
host = "10.1.203.90 10.5.203.90"
port = 636
use_ssl = true
start_tls = false
ssl_skip_verify = true
bind_dn = "example\\%s"
search_filter = "(sAMAccountName=%s)"
search_base_dns = ["dc=example,dc=com"]
[servers.attributes]
name = "givenName"
surname = "sn"
username = "cn"
member_of = "memberOf"
email = "mail"
[[servers.group_mappings]]
group_dn = "cn=GrafanaAdmins,ou=Security Groups,ou=example,dc=example,dc=com"
org_role = "Admin"
[[servers.group_mappings]]
group_dn = "cn=GrafanaEditors,ou=Security Groups,ou=example,dc=example,dc=com"
org_role = "Editor"
[[servers.group_mappings]]
group_dn = "*"
org_role = "Viewer"
persistence:
type: pvc
enabled: true
plugins:
- grafana-oncall-app
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
node-exporter-full:
gnetId: 1860
revision: 33
datasource: Mimir
allowUpdates: true
postgresql:
gnetId: 9628
revision: 7
datasource: Mimir
allowUpdates: true
blackbox:
gnetId: 14928
revision: 6
datasource: Mimir
allowUpdates: true
# -- Grafana data sources config. Connects to all three by default
datasources:
datasources.yaml:
apiVersion: 1
# -- Datasources linked to the Grafana instance. Override if you disable any components.
datasources:
# https://grafana.com/docs/grafana/latest/datasources/loki/#provision-the-loki-data-source
- name: Loki
uid: loki
type: loki
url: http://{{ .Release.Name }}-loki-gateway
isDefault: false
# https://grafana.com/docs/grafana/latest/datasources/prometheus/#provision-the-data-source
- name: Mimir
uid: prom
type: prometheus
url: http://{{ .Release.Name }}-mimir-nginx/prometheus
isDefault: true
# https://grafana.com/docs/grafana/latest/datasources/tempo/configure-tempo-data-source/#provision-the-data-source
# - name: Tempo
# uid: tempo
# type: tempo
# url: http://{{ .Release.Name }}-tempo-query-frontend:3100
# isDefault: false
# jsonData:
# tracesToLogsV2:
# datasourceUid: loki
# lokiSearch:
# datasourceUid: loki
# tracesToMetrics:
# datasourceUid: prom
# serviceMap:
# datasourceUid: prom
loki:
# -- Deploy Loki if enabled. See [upstream readme](https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed#values) for full values reference.
enabled: true
ingress:
enabled: false
gateway:
enabled: true
ingress:
enabled: true
ingressClassName: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt
acme.cert-manager.io/http01-edit-in-place: "true"
hosts:
- host: loki.example.com
paths:
- path: /
pathType: Prefix
- host: logs.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: loki-tls
hosts:
- loki.example.com
- logs.example.com
indexGateway:
enabled: true
persistence:
enabled: true
loki:
structuredConfig:
common:
path_prefix: /var/loki
replication_factor: 3
storage:
azure:
account_name: acname
account_key: "account_key_value"
container_name: logs
request_timeout: 0
limits_config:
enforce_metric_name: false
max_cache_freshness_per_query: 10m
reject_old_samples: true
reject_old_samples_max_age: 168h
split_queries_by_interval: 15m
storage_config:
azure:
account_name: acname
account_key: "account_key_value"
container_name: logs
request_timeout: 0
boltdb_shipper:
active_index_directory: /var/loki/boltdb-shipper-active
cache_location: /var/loki/boltdb-shipper-cache
cache_ttl: 24h
shared_store: azure
# -- Mimir chart values. Resources are set to a minimum by default.
mimir:
# -- Deploy Mimir if enabled. See [upstream values.yaml](https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml) for full values reference.
enabled: true
nginx:
ingress:
# -- Specifies whether an ingress for the nginx should be created
enabled: true
# -- Ingress Class Name. MAY be required for Kubernetes versions >= 1.18
ingressClassName: nginx
# -- Annotations for the nginx ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt
acme.cert-manager.io/http01-edit-in-place: "true"
# -- Hosts configuration for the nginx ingress
hosts:
- host: metrics.example.com
paths:
- path: /
pathType: Prefix
- host: mimir.example.com
paths:
- path: /
pathType: Prefix
# -- TLS configuration for the nginx ingress
tls:
- secretName: mimir-tls
hosts:
- metrics.example.com
- mimir.example.com
alertmanager:
resources:
requests:
cpu: 20m
compactor:
resources:
requests:
cpu: 20m
distributor:
resources:
requests:
cpu: 20m
ingester:
replicas: 2
zoneAwareReplication:
enabled: false
resources:
requests:
cpu: 20m
persistentVolume:
size: 20Gi
overrides_exporter:
resources:
requests:
cpu: 20m
querier:
replicas: 1
resources:
requests:
cpu: 20m
query_frontend:
resources:
requests:
cpu: 20m
query_scheduler:
replicas: 1
resources:
requests:
cpu: 20m
ruler:
resources:
requests:
cpu: 20m
minio:
enabled: false
store_gateway:
zoneAwareReplication:
enabled: false
resources:
requests:
cpu: 20m
rollout_operator:
resources:
requests:
cpu: 20m
mimir:
structuredConfig:
blocks_storage:
backend: azure
azure:
account_name: acname
account_key: "account_key_value"
container_name: "metrics"
endpoint_suffix: "blob.core.windows.net"
max_retries: 20
tsdb:
dir: /data/ingester
ruler:
rule_path: /data/ruler
alertmanager_url: http://127.0.0.1:8080/alertmanager
ring:
# Quickly detect unhealthy rulers to speed up the tutorial.
heartbeat_period: 2s
heartbeat_timeout: 10s
ruler_storage:
backend: azure
azure:
account_name: acname
account_key: "account_key_value"
container_name: "mimir-ruler"
endpoint_suffix: "blob.core.windows.net"
max_retries: 20
alertmanager:
data_dir: /data/alertmanager
#fallback_config_file: /etc/alertmanager-fallback-config.yaml
external_url: https://mimir.example.com/alertmanager
alertmanager_storage:
backend: azure
azure:
account_name: acname
account_key: "account_key_value"
container_name: "mimir-alertmanager"
endpoint_suffix: "blob.core.windows.net"
max_retries: 20
server:
log_level: warn
limits:
max_global_series_per_user: 1350000
ingestion_rate: 220000
ingestion_burst_size: 2400000
max_label_names_per_series: 50
tempo:
# -- Deploy Tempo if enabled. See [upstream readme](https://github.com/grafana/helm-charts/blob/main/charts/tempo-distributed/README.md#values) for full values reference.
enabled: false
@tjdavis3 Thanks for sharing the values. Appreciate it
I noticed that in the
mimir-distributedchart, the nginx usage is deprecated in favour of using the gateway (https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L2123-L2128).But the
grafanachart has the nginx service name hardcoded (https://github.com/grafana/helm-charts/blob/main/charts/lgtm-distributed/values.yaml#L22).Would it be worth (/possible?) to make the grafana chart use the correct service name (
http://{{ .Release.Name }}-mimir-nginx/prometheusifnginx: enabled: trueorhttp://{{ .Release.Name }}-mimir-gateway/prometheusifgateway: enabledNonEnterprise: true)?Alternatively/In the mean time (just noted in the migration documentation) we can add a "legacy" service that always names itself
{{ .Release.Name }}-mimir-nginxeven if the gateway is used.I see that this is possible by using the
nameOverrideproperty (via https://grafana.com/docs/helm-charts/mimir-distributed/latest/migration-guides/migrate-to-unified-proxy-deployment/):# Migrate to gateway gateway: enabledNonEnterprise: true service: nameOverride: mimir-nginx nginx: enabled: false
I am not sure if I am also hitting the same thing, just happened to see this after opening an issue here
