helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

Preconfigured and all-in-one LGTM stack helm chart

Open mamiu opened this issue 3 years ago • 30 comments

Grafana has managed to offer a complete observability stack with all its different solutions. Thanks a lot for that!

However, the different applications have to be deployed individually and it isn't always easy to configure these applications so that they all work well together.

Therefore, I am currently looking for a helm chart that can be used to deploy all applications of the LGTM (Loki, Grafana, Tempo, Mimir) stack plus all other dependencies (e.g. like Prometheus, Minio, etc.) all at once. It would be great if this chart included an opinionated pre-configuration of all applications, so that the whole LGTM stack (+dependencies) works out-of-the-box.

Here is the overview of the LGTM stack on the Grafana GitHub organisation homepage:

A wallpaper showing the logos for Loki, Grafana, Tempo, and Mimir, spelling "LGTM".

[...]

LGTM

We have many projects, but we want to highlight a few, in a quite specific order:

  • Loki, like Prometheus, but for logs.
  • Grafana, the open and composable observability and data visualization platform.
  • Tempo, a high volume, minimal dependency distributed tracing backend.
  • Mimir, the most scalable Prometheus backend.

I have also looked at and tried the Grafana Agent and Grafana Agent Operator (which promises similar things), but it only deploys the Grafana Agent, but not the entire LGTM stack.

The closest example I found is this docker-compose.yml in the Mimir tutorials. But it's just a docker-compose config file (rather than a helm chart) and it doesn't include Loki and Tempo (because it's just an example).

Does such a helm chart already exist or is it in the pipeline? Or do you know of any other open source repository that contains a similar helm chart?

If not, is there a reason why an all-in-one LGTM (+dependencies) stack helm chart doesn't exist and isn't planned?


Edit (Oct 4, 2023)

To quote my own comment from down here:

There's one other Grafana project that would complement this stack chart quite well: Grafana OnCall I didn't include it when I created this issue because it wasn't available at that point.

mamiu avatar May 20 '22 03:05 mamiu

@krajorama @trevorwhitney @joe-elliott @jdbaldry Is this the correct repository to ask that kind of question or is there a better repo for this issue?

BTW: After browsing through the issues in this repo I realized that a lot of them would be resolved immediately by having a preconfigured all-in-one helm chart.

mamiu avatar May 23 '22 23:05 mamiu

Hi @mamiu, thanks for creating this issue.

We do not currently have an all-in-one chart in progress but I believe you are right that it would solve a number of our outstanding issues and also provide a really nice user experience for playing with the whole suite of Grafana projects.

I have raised this in our internal working group and will get you a comprehensive answer to the rest of your questions.

jdbaldry avatar May 24 '22 08:05 jdbaldry

@jdbaldry Thanks a lot for taking on this issue and for raising it in Grafana's internal working group.

mamiu avatar May 24 '22 18:05 mamiu

I'm back with a small update.

I've had a brief chat with some of the members of our Helm working group and we all agree this would be a valuable addition. The primary reason we have not started this yet is that we are keen to dogfood our work to ensure that it works properly and our internal architecture differs enough that we wouldn't be able to easily integrate the all-in-one chart for running Grafana Cloud services.

The working group will definitely keep this on the agenda.

jdbaldry avatar May 25 '22 09:05 jdbaldry

Thanks for the update @jdbaldry.

That'd be amazing! If a separate Kubernetes cluster (for development and testing purposes) would be beneficial for you guys, I'm happy to sponsor a three node cluster for this (each node has 6 vCPU Cores, 16GB RAM, 100GB NVMe Storage, 1 Gbit/s Bandwidth). I can give you either SSH access to the nodes or if you don't want to manage the Kubernetes installation yourself, I can do this for you and would give you access to a newly setup cluster. No one else will use it (or get access to it) while you are working on it. Would that be beneficial for you to build a generic all-in-one chart?

mamiu avatar May 26 '22 01:05 mamiu

Any updates on this? I would like to PoC the stack and it would be very easy with a all-encompassing chart.

elocke avatar Sep 23 '22 15:09 elocke

There is no current plan to work on this. If a community member would like to build this and PR it to this repo I would be happy to take a look. I think to be most successful it should use grafana/mimir-distributed source, grafana/loki source, and grafana/tempo-distributed source, as those are our most actively maintained charts for each database. I would also recommend the addition of grafana/agent-operator, as both grafana/mimir-distributed and grafana/loki have agent operator integrations.

trevorwhitney avatar Sep 23 '22 23:09 trevorwhitney

I took a quick stab at this just to get a feeling for what needs to be done, so far I've found:

You can combine the following helm charts

  • grafana/mimir-distributed
  • grafana/loki-distributed
  • grafana/tempo-distributed
  • grafana/grafana

The first thing I noticed is the naming, you might want to set a fullNameOverride to change loki-distributed to loki for example.

Out of the box the grafana helm chart doesn't have datasources for mimir/loki/tempo, so including those in the top level chart that combines them would be nice. I think the loki stack chart has some already, so they could be reused.

Both Mimir and Tempo deploy their own Minio, ideally deploying minio would be set to false for those two, then include minio in the top level LGTM stack chart.

Mimir deploys in multi-tenant mode by default, so that either needs to be false, or, the data source needs to include the x-scope-orgId header.

I'll add more as I learn more

edude03 avatar Nov 22 '22 17:11 edude03

How far did you get with this @edude03? Working on this anywhere we can collab?

matt-psaltis avatar Jan 25 '23 09:01 matt-psaltis

Any updates on this? :)

adnanQB avatar Apr 26 '23 08:04 adnanQB

I've been playing with the LGTM stack on my personal cluster, the current code could be of interest for this issue.

Things to note though:

  • I'm using kustomize to combine helm charts and add ad-hoc manifests, but this should be straightforward to adapt into a new chart
  • The code is a big mess while I'm hacking on it
  • For now, things are tailored for a dev deployment. I think a united chart would only make sense for such deployments though, as a production setup would require to invest time in the specific products and tailor them as needed.
  • I'm using community dashboards instead of grafana cloud, so some relabels are in place to make the output look like a kube-prometheus stack for compatibility.
  • As of now I've been using the grafana agent operator, but with the 0.33 release I'm eyeing at replacing it with the flow mode instead

If there is interest, I'd be very happy to contribute to this problem!

Chewie avatar Apr 26 '23 12:04 Chewie

@Chewie this is really helpful, thank you so much for sharing it.

nronnei avatar Jul 10 '23 17:07 nronnei

I've been playing with the LGTM stack on my personal cluster, the current code could be of interest for this issue.

Things to note though:

  • I'm using kustomize to combine helm charts and add ad-hoc manifests, but this should be straightforward to adapt into a new chart
  • The code is a big mess while I'm hacking on it
  • For now, things are tailored for a dev deployment. I think a united chart would only make sense for such deployments though, as a production setup would require to invest time in the specific products and tailor them as needed.
  • I'm using community dashboards instead of grafana cloud, so some relabels are in place to make the output look like a kube-prometheus stack for compatibility.
  • As of now I've been using the grafana agent operator, but with the 0.33 release I'm eyeing at replacing it with the flow mode instead

If there is interest, I'd be very happy to contribute to this problem!

Cool! I think you would love helmfile. It supports secrets (sops, vault, etc), kustomize "as charts", patches, etc.

lucasfcnunes avatar Jul 28 '23 01:07 lucasfcnunes

I have started working on an LGTM stack helm chart for my own project, happy to raise a PR in this repo if that would be helpful as a starting point

timberhill avatar Sep 20 '23 10:09 timberhill

The PR is now merged and the distributed chart should be available now. Might have a look at making another one in a bit using the non-distributed charts for simpler use cases.

UPD: Loki and Tempo have the non-distributed versions of the charts, but Mimir by its nature is run as a set of microservices. If we want the simplest deployment the fewer pods than the distributed charts generate, then we could replace Mimir with Prometheus, but we cannot call it LGTM at that point. If anyone has use cases out there, let me know.

timberhill avatar Oct 04 '23 07:10 timberhill

@timberhill Thanks a lot for creating this LGTM stack helm chart!

There's one other Grafana project that would complement this stack chart quite well: Grafana OnCall I didn't include it when I created this issue because it wasn't available at that point.

If you have the time and also think it would be useful for most people, it would be great if you could include it in your chart (even if it's disabled by default).

mamiu avatar Oct 04 '23 16:10 mamiu

Great work @timberhill. I was looking for something like this which would allow me to install the lgtm stack as seamlessly as possible.

Didn't find a spot-on channel on Slack so I'm asking here:

Is this supposed to run standalone or do I need to install somethings else (like Prometheus, grafana-agent etc) to start ingesting logs and metrics? When trying the chart I can see that it provisions me the grafana-agent-operator so my first thought was that agent pods would start popping up on the nodes in the cluster but they don't and I can't access any logs in Grafana.

winterrobert avatar Oct 05 '23 14:10 winterrobert

@mamiu yeah could do, should be fairly easy.

@winterrobert both mimir-distributed and tempo-distributed install grafana agent operator for self monitoring by default. Personally I would prefer to keep the stack lean and only provide endpoints for data ingest without any data collection, but will need to dig deeper into why those charts are set up the way they are before doing so. As it stands, this chart is using the defaults set by Grafana.

timberhill avatar Oct 05 '23 15:10 timberhill

Thanks @timberhill.

When trying to get the grafana-agent to work the lgtm-distributed chart, what should the URLs be for my LogsInstance and MetricsInstance to ship logs and metrics to Loki and Mimir?

apiVersion: monitoring.grafana.com/v1alpha1
kind: LogsInstance
metadata:
  name: primary
  labels:
    agent: grafana-agent-logs
spec:
  clients:
  - url: "lgtm-distributed-loki-ingester.lgtm-distributed.svc.cluster.local:3100/api/v1/push?"
.....

apiVersion: monitoring.grafana.com/v1alpha1
kind: MetricsInstance
metadata:
  name: primary
  labels:
    agent: grafana-agent-metrics
spec:
  remoteWrite:
  - url: "http://lgtm-distributed-mimir-nginx.lgtm-distributed.svc.cluster.local/api/v1/push?"
....

winterrobert avatar Oct 06 '23 14:10 winterrobert

Good question, the docs are not really covering it very well. From what I understand, it's the distributor that is getting the metrics in, so you should use the -mimir-ingester-headless service for metrics. For Loki it must be -loki-distributor service.

Let me know about your tests there, that would be very much appreciated - I haven't used Grafana Agent before and need to figure it out first :)

timberhill avatar Oct 07 '23 19:10 timberhill

I noticed that in the mimir-distributed chart, the nginx usage is deprecated in favour of using the gateway (https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L2123-L2128).

But the grafana chart has the nginx service name hardcoded (https://github.com/grafana/helm-charts/blob/main/charts/lgtm-distributed/values.yaml#L22).

Would it be worth (/possible?) to make the grafana chart use the correct service name (http://{{ .Release.Name }}-mimir-nginx/prometheus if nginx: enabled: true or http://{{ .Release.Name }}-mimir-gateway/prometheus if gateway: enabledNonEnterprise: true)?

Alternatively/In the mean time (just noted in the migration documentation) we can add a "legacy" service that always names itself {{ .Release.Name }}-mimir-nginx even if the gateway is used.

I see that this is possible by using the nameOverride property (via https://grafana.com/docs/helm-charts/mimir-distributed/latest/migration-guides/migrate-to-unified-proxy-deployment/):

  # Migrate to gateway
  gateway:
    enabledNonEnterprise: true
    service:
      nameOverride: mimir-nginx
  nginx:
    enabled: false

jasperroel avatar Oct 27 '23 13:10 jasperroel

Would love to see a solid way of making Grafana Agent in flow mode part of this chart. Happy to contribute if others think this would make sense to do.

Pionerd avatar Nov 09 '23 17:11 Pionerd

I've been attempting to use the lgtm-distributed chart to get everything running on AKS. The installation goes fine and everything is up and running. Where I'm having trouble is getting an ingress configured. No matter what I've tried I have not been successful getting external access working. Admittedly, I'm barely proficient with kubernetes and helm, but since this chart is designed to work out of the box it would be helpful if the readme included examples of getting external access working.

tjdavis3 avatar Nov 30 '23 15:11 tjdavis3

@tjdavis3 Hi, Could you help with the steps you followed to deploy LGTM distributed helm chart in Kubernetes.

Nikhil-Devisetti avatar Apr 03 '24 06:04 Nikhil-Devisetti

@Nikhil-Devisetti I created a values file from https://github.com/grafana/helm-charts/blob/main/charts/lgtm-distributed/values.yaml and installed by calling helm with -f values.yml. My values configures Azure blob storage as the backend and uses OAuth for authentication. Here's what a sanitized version looks like (I removed our domains, access keys, etc).

---
minio:
  enabled: false
grafana:
  # -- Deploy Grafana if enabled. See [upstream readme](https://github.com/grafana/helm-charts/tree/main/charts/grafana#configuration) for full values reference.
  enabled: true

  image:
    tag: 10.2.2

  ingress:
    enabled: true
    ingressClassName: nginx
    # Values can be templated
    annotations:
        cert-manager.io/cluster-issuer: letsencrypt
        acme.cert-manager.io/http01-edit-in-place: "true"
    labels: {}
    path: /

    # pathType is only for k8s >= 1.1=
    pathType: Prefix

    hosts:
      - lgtm.example.com
    tls:
      - secretName: grafana-tls
        hosts:
          - lgtm.example.com
  grafana.ini:
    server:
      root_url: https://lgtm.example.com

    auth.generic_oauth:
      enabled: true
      name: RingSquared
      allow_sign_up: true
      client_id: LGTM
      client_secret: ........-....-....-....-..........
      scopes: openid email profile offline_access roles
      #  - openid
      #  - email
      #  - profile
      #  - offline_access
      #  - roles
      email_attribute_path: email
      login_attribute_path: preferred_username
      name_attribute_path: full_name
      auth_url: https://auth.example.com/auth/realms/RealmName/protocol/openid-connect/auth
      token_url: https://auth.example.com/auth/realms/RealmName/protocol/openid-connect/token
      api_url: https://auth.example.com/auth/realms/RealmName/protocol/openid-connect/userinfo
      role_attribute_path: contains(realm_access.roles[*], 'grafanaadmin') && 'GrafanaAdmin' || contains(realm_access.roles[*], 'admin') && 'Admin' || contains(realm_access.roles[*], 'grafana-editor') && 'Editor' || 'Viewer'
      allow_assign_grafana_admin: true
      auto_assign_org_role: Editor


  ldap:
    enabled: false
    # `config` is the content of `ldap.toml` that will be stored in the created secret
    config: |-
      [[servers]]
      host = "10.1.203.90 10.5.203.90"
      port = 636
      use_ssl = true
      start_tls = false
      ssl_skip_verify = true

      bind_dn = "example\\%s"

      search_filter = "(sAMAccountName=%s)"

      search_base_dns = ["dc=example,dc=com"]



      [servers.attributes]
      name = "givenName"
      surname = "sn"
      username = "cn"
      member_of = "memberOf"
      email =  "mail"

      [[servers.group_mappings]]
      group_dn = "cn=GrafanaAdmins,ou=Security Groups,ou=example,dc=example,dc=com"
      org_role = "Admin"

      [[servers.group_mappings]]
      group_dn = "cn=GrafanaEditors,ou=Security Groups,ou=example,dc=example,dc=com"
      org_role = "Editor"

      [[servers.group_mappings]]
      group_dn = "*"
      org_role = "Viewer"

  persistence:
    type: pvc
    enabled: true

  plugins:
    - grafana-oncall-app

  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/default
  dashboards:
    default:
      node-exporter-full:
        gnetId: 1860
        revision: 33
        datasource: Mimir
        allowUpdates: true
      postgresql:
        gnetId: 9628
        revision: 7
        datasource: Mimir
        allowUpdates: true
      blackbox:
        gnetId: 14928
        revision: 6
        datasource: Mimir
        allowUpdates: true

  # -- Grafana data sources config. Connects to all three by default
  datasources:
    datasources.yaml:
      apiVersion: 1
      # -- Datasources linked to the Grafana instance. Override if you disable any components.
      datasources:
        # https://grafana.com/docs/grafana/latest/datasources/loki/#provision-the-loki-data-source
        - name: Loki
          uid: loki
          type: loki
          url: http://{{ .Release.Name }}-loki-gateway
          isDefault: false
        # https://grafana.com/docs/grafana/latest/datasources/prometheus/#provision-the-data-source
        - name: Mimir
          uid: prom
          type: prometheus
          url: http://{{ .Release.Name }}-mimir-nginx/prometheus
          isDefault: true
        # https://grafana.com/docs/grafana/latest/datasources/tempo/configure-tempo-data-source/#provision-the-data-source
#        - name: Tempo
#          uid: tempo
#          type: tempo
#          url: http://{{ .Release.Name }}-tempo-query-frontend:3100
#          isDefault: false
#          jsonData:
#            tracesToLogsV2:
#              datasourceUid: loki
#            lokiSearch:
#              datasourceUid: loki
#            tracesToMetrics:
#              datasourceUid: prom
#            serviceMap:
#              datasourceUid: prom

loki:
  # -- Deploy Loki if enabled. See [upstream readme](https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed#values) for full values reference.
  enabled: true
  ingress:
    enabled: false
  gateway:
    enabled: true
    ingress:
      enabled: true
      ingressClassName: nginx
      annotations:
          cert-manager.io/cluster-issuer: letsencrypt
          acme.cert-manager.io/http01-edit-in-place: "true"
      hosts:
        - host: loki.example.com
          paths:
            - path: /
              pathType: Prefix
        - host: logs.example.com
          paths:
            - path: /
              pathType: Prefix
      tls:
        - secretName: loki-tls
          hosts:
            - loki.example.com
            - logs.example.com

  indexGateway:
    enabled: true
    persistence:
      enabled: true

  loki:
    structuredConfig:
      common:
        path_prefix: /var/loki
        replication_factor: 3
        storage:
          azure:
            account_name: acname
            account_key: "account_key_value"
            container_name: logs
            request_timeout: 0
      limits_config:
        enforce_metric_name: false
        max_cache_freshness_per_query: 10m
        reject_old_samples: true
        reject_old_samples_max_age: 168h
        split_queries_by_interval: 15m
      storage_config:
        azure:
          account_name: acname
          account_key: "account_key_value"
          container_name: logs
          request_timeout: 0
        boltdb_shipper:
          active_index_directory: /var/loki/boltdb-shipper-active
          cache_location: /var/loki/boltdb-shipper-cache
          cache_ttl: 24h
          shared_store: azure


# -- Mimir chart values. Resources are set to a minimum by default.
mimir:
  # -- Deploy Mimir if enabled. See [upstream values.yaml](https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml) for full values reference.
  enabled: true
  nginx:
    ingress:
      # -- Specifies whether an ingress for the nginx should be created
      enabled: true
      # -- Ingress Class Name. MAY be required for Kubernetes versions >= 1.18
      ingressClassName: nginx
      # -- Annotations for the nginx ingress
      annotations:
        cert-manager.io/cluster-issuer: letsencrypt
        acme.cert-manager.io/http01-edit-in-place: "true"
      # -- Hosts configuration for the nginx ingress
      hosts:
        - host: metrics.example.com
          paths:
            - path: /
              pathType: Prefix
        - host: mimir.example.com
          paths:
            - path: /
              pathType: Prefix
      # -- TLS configuration for the nginx ingress
      tls:
        - secretName: mimir-tls
          hosts:
            - metrics.example.com
            - mimir.example.com
  alertmanager:
    resources:
      requests:
        cpu: 20m
  compactor:
    resources:
      requests:
        cpu: 20m
  distributor:
    resources:
      requests:
        cpu: 20m
  ingester:
    replicas: 2
    zoneAwareReplication:
      enabled: false
    resources:
      requests:
        cpu: 20m
    persistentVolume:
      size: 20Gi
  overrides_exporter:
    resources:
      requests:
        cpu: 20m
  querier:
    replicas: 1
    resources:
      requests:
        cpu: 20m
  query_frontend:
    resources:
      requests:
        cpu: 20m
  query_scheduler:
    replicas: 1
    resources:
      requests:
        cpu: 20m
  ruler:
    resources:
      requests:
        cpu: 20m
  minio:
    enabled: false

  store_gateway:
    zoneAwareReplication:
      enabled: false
    resources:
      requests:
        cpu: 20m

  rollout_operator:
    resources:
      requests:
        cpu: 20m
  mimir:
    structuredConfig:
      blocks_storage:
        backend: azure
        azure:
          account_name: acname
          account_key: "account_key_value"
          container_name: "metrics"
          endpoint_suffix: "blob.core.windows.net"
          max_retries: 20
        tsdb:
          dir: /data/ingester

      ruler:
        rule_path: /data/ruler
        alertmanager_url: http://127.0.0.1:8080/alertmanager
        ring:
          # Quickly detect unhealthy rulers to speed up the tutorial.
          heartbeat_period: 2s
          heartbeat_timeout: 10s

      ruler_storage:
        backend: azure
        azure:
          account_name: acname
          account_key: "account_key_value"
          container_name: "mimir-ruler"
          endpoint_suffix: "blob.core.windows.net"
          max_retries: 20

      alertmanager:
        data_dir: /data/alertmanager
        #fallback_config_file: /etc/alertmanager-fallback-config.yaml
        external_url: https://mimir.example.com/alertmanager

      alertmanager_storage:
        backend: azure
        azure:
          account_name: acname
          account_key: "account_key_value"
          container_name: "mimir-alertmanager"
          endpoint_suffix: "blob.core.windows.net"
          max_retries: 20

      server:
        log_level: warn

      limits:
        max_global_series_per_user: 1350000
        ingestion_rate: 220000
        ingestion_burst_size: 2400000
        max_label_names_per_series: 50


tempo:
  # -- Deploy Tempo if enabled.  See [upstream readme](https://github.com/grafana/helm-charts/blob/main/charts/tempo-distributed/README.md#values) for full values reference.
  enabled: false

tjdavis3 avatar Apr 04 '24 16:04 tjdavis3

@tjdavis3 Thanks for sharing the values. Appreciate it

Nikhil-Devisetti avatar Apr 08 '24 10:04 Nikhil-Devisetti

I noticed that in the mimir-distributed chart, the nginx usage is deprecated in favour of using the gateway (https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L2123-L2128).

But the grafana chart has the nginx service name hardcoded (https://github.com/grafana/helm-charts/blob/main/charts/lgtm-distributed/values.yaml#L22).

Would it be worth (/possible?) to make the grafana chart use the correct service name (http://{{ .Release.Name }}-mimir-nginx/prometheus if nginx: enabled: true or http://{{ .Release.Name }}-mimir-gateway/prometheus if gateway: enabledNonEnterprise: true)?

Alternatively/In the mean time (just noted in the migration documentation) we can add a "legacy" service that always names itself {{ .Release.Name }}-mimir-nginx even if the gateway is used.

I see that this is possible by using the nameOverride property (via https://grafana.com/docs/helm-charts/mimir-distributed/latest/migration-guides/migrate-to-unified-proxy-deployment/):

  # Migrate to gateway
  gateway:
    enabledNonEnterprise: true
    service:
      nameOverride: mimir-nginx
  nginx:
    enabled: false

I am not sure if I am also hitting the same thing, just happened to see this after opening an issue here

govindkailas avatar Jun 26 '24 22:06 govindkailas