grafana-image-renderer icon indicating copy to clipboard operation
grafana-image-renderer copied to clipboard

High CPU at idle after failed imager render (Docker Instance with mode:clustered)

Open lux4rd0 opened this issue 2 years ago • 4 comments

What happened:

Sending a request from Grafana to the localhost docker container to generate a PDF results in an incomplete PDF with missing panels and processes leftover in the container running at 100% CPU utilization. Resubmitting the PDF request will get different results (of missing panels). Rendering the dashboards in Grafana is instantaneous, so it's not a data source response issue or a Grafana issue but possibly a renderer-only issue.

high_cpu

high_cpu_docker

image_renderer

What you expected to happen:

Complete PDF, all threads and processes completing with no CPU usage when idle.

How to reproduce it (as minimally and precisely as possible):

Submit a request for a PDF. Enjoy the show. :)

Anything else we need to know?:

This CPU issue has been happening for a while across multiple versions of Grafana and Image Renderer. I stood up a new instance of Grafana using just the Image Renderer as a plugin and not as a docker container, and I did not get this behavior. There may be a configuration issue, but I've stripped back the service to all defaults, which continues to happen. Any troubleshooting advice would be appreciated. I've attached the logs from the Docker service:

Explore-logs-2021-08-06 09_01_51.txt

Hardware: 3 node cluster using Docker, 6 CPUs each, 8 GB of memory

Docker compose:

  renderer:
    container_name: grafana-renderer
    environment:
      GF_METRICS_ENABLED: "true"
      GF_REPORTING_IMAGE_SCALE_FACTOR: 2
      GF_REPORTING_RENDERING_TIMEOUT: 600s
      GF_LOG_FILTERS: rendering:debug
    image: grafana/grafana-image-renderer
    logging:
      driver: loki
      options:
        loki-url: http://log01.tylephony.com:3100/loki/api/v1/push
    ports:
    - 8081:8081/tcp
    restart: always
    volumes:
    - /mnt/docker/grafana-enterprise/config.json:/usr/src/app/config.json:ro

Using this for Grafana:

      GF_RENDERING_CALLBACK_URL: http://grafana-enterprise:3000/
      GF_RENDERING_SERVER_URL: http://grafana-renderer:8081/render

Which keeps requests for the image-renderer local to the docker network instead of being spread across the load balancer.

config.json (Slightly updated from the default provided in the docker container - https://github.com/grafana/grafana-image-renderer/blob/master/devenv/docker/custom-config/config.json )

{
  "service": {
    "host": null,
    "port": 8081,

    "metrics": {
      "enabled": true,
      "collectDefaultMetrics": true,
      "requestDurationBuckets": [1, 5, 7, 9, 11, 13, 15, 20, 30]
    },

    "logging": {
      "level": "debug",
      "console": {
        "json": true,
        "colorize": false
      }
    }
  },
  "rendering": {
    "chromeBin": null,
    "args": [
      "--no-sandbox",
      "--force-color-profile=generic-rgb"
    ],
    "ignoresHttpsErrors": false,

    "timezone": null,
    "acceptLanguage": null,
    "width": 1000,
    "height": 500,
    "deviceScaleFactor": 4,
    "maxWidth": 10000,
    "maxHeight": 10000,
    "maxDeviceScaleFactor": 4,

    "mode": "clustered",
    "clustering": {
      "mode": "browser",
      "maxConcurrency": 50
    },

    "verboseLogging": true,
    "dumpio": false
  }
}

Environment:

  • Grafana Image Renderer version: Docker:latest (v3.0.1)
  • Grafana version: Grafana v8.1.0 (62e720c06b)
  • Installed plugin or remote renderer service: remote renderer service
  • OS Grafana Image Renderer is installed on: CentOS 7
  • User OS & Browser: MacOS Big Sur - Version 92.0.4515.131 (Official Build) (x86_64)
  • Others:

lux4rd0 avatar Aug 06 '21 14:08 lux4rd0

I do see some call back errors in Grafana like this:

t=2021-08-06T14:47:08+0000 lvl=warn msg="Request Origin is not authorized" logger=live origin=http://grafana-enterprise:3000 appUrl=http://grafana.tylephony.com/ allowedOrigins=

Since it is coming from a docker instance and Grafana is load-balanced (the Enterprise key is tied to the load balanced address - not the image renderer Docker named instance). However - I would expect everything to fail if this was an issue - not a partial PDF. (Nor is it specifically related to high CPU after a failure).

lux4rd0 avatar Aug 06 '21 14:08 lux4rd0

When pulling this back to only using the defaults of the docker container - I'm not seeing the CPU issues (still not getting all of the panels rendered though...)

The difference is default (no high CPU) versus clustered (high CPU):

    "mode": "default",
    "clustering": {
      "mode": "browser",
      "maxConcurrency": 5
    },

and

    "mode": "clustered",
    "clustering": {
      "mode": "browser",
      "maxConcurrency": 6
    },

lux4rd0 avatar Aug 06 '21 15:08 lux4rd0

@lux4rd0, try to use that:

    "mode": "clustered",
    "clustering": {
      "mode": "context",
      "maxConcurrency": 6
    },

mode": "context" it's working for my.

tuhnu9089 avatar Aug 13 '21 14:08 tuhnu9089

Thanks, @tuhnu9089 - I'll give it a shot.

lux4rd0 avatar Aug 13 '21 21:08 lux4rd0