grafana-image-renderer
grafana-image-renderer copied to clipboard
High CPU at idle after failed imager render (Docker Instance with mode:clustered)
What happened:
Sending a request from Grafana to the localhost docker container to generate a PDF results in an incomplete PDF with missing panels and processes leftover in the container running at 100% CPU utilization. Resubmitting the PDF request will get different results (of missing panels). Rendering the dashboards in Grafana is instantaneous, so it's not a data source response issue or a Grafana issue but possibly a renderer-only issue.
What you expected to happen:
Complete PDF, all threads and processes completing with no CPU usage when idle.
How to reproduce it (as minimally and precisely as possible):
Submit a request for a PDF. Enjoy the show. :)
Anything else we need to know?:
This CPU issue has been happening for a while across multiple versions of Grafana and Image Renderer. I stood up a new instance of Grafana using just the Image Renderer as a plugin and not as a docker container, and I did not get this behavior. There may be a configuration issue, but I've stripped back the service to all defaults, which continues to happen. Any troubleshooting advice would be appreciated. I've attached the logs from the Docker service:
Explore-logs-2021-08-06 09_01_51.txt
Hardware: 3 node cluster using Docker, 6 CPUs each, 8 GB of memory
Docker compose:
renderer:
container_name: grafana-renderer
environment:
GF_METRICS_ENABLED: "true"
GF_REPORTING_IMAGE_SCALE_FACTOR: 2
GF_REPORTING_RENDERING_TIMEOUT: 600s
GF_LOG_FILTERS: rendering:debug
image: grafana/grafana-image-renderer
logging:
driver: loki
options:
loki-url: http://log01.tylephony.com:3100/loki/api/v1/push
ports:
- 8081:8081/tcp
restart: always
volumes:
- /mnt/docker/grafana-enterprise/config.json:/usr/src/app/config.json:ro
Using this for Grafana:
GF_RENDERING_CALLBACK_URL: http://grafana-enterprise:3000/
GF_RENDERING_SERVER_URL: http://grafana-renderer:8081/render
Which keeps requests for the image-renderer local to the docker network instead of being spread across the load balancer.
config.json (Slightly updated from the default provided in the docker container - https://github.com/grafana/grafana-image-renderer/blob/master/devenv/docker/custom-config/config.json )
{
"service": {
"host": null,
"port": 8081,
"metrics": {
"enabled": true,
"collectDefaultMetrics": true,
"requestDurationBuckets": [1, 5, 7, 9, 11, 13, 15, 20, 30]
},
"logging": {
"level": "debug",
"console": {
"json": true,
"colorize": false
}
}
},
"rendering": {
"chromeBin": null,
"args": [
"--no-sandbox",
"--force-color-profile=generic-rgb"
],
"ignoresHttpsErrors": false,
"timezone": null,
"acceptLanguage": null,
"width": 1000,
"height": 500,
"deviceScaleFactor": 4,
"maxWidth": 10000,
"maxHeight": 10000,
"maxDeviceScaleFactor": 4,
"mode": "clustered",
"clustering": {
"mode": "browser",
"maxConcurrency": 50
},
"verboseLogging": true,
"dumpio": false
}
}
Environment:
- Grafana Image Renderer version: Docker:latest (v3.0.1)
- Grafana version: Grafana v8.1.0 (62e720c06b)
- Installed plugin or remote renderer service: remote renderer service
- OS Grafana Image Renderer is installed on: CentOS 7
- User OS & Browser: MacOS Big Sur - Version 92.0.4515.131 (Official Build) (x86_64)
- Others:
I do see some call back errors in Grafana like this:
t=2021-08-06T14:47:08+0000 lvl=warn msg="Request Origin is not authorized" logger=live origin=http://grafana-enterprise:3000 appUrl=http://grafana.tylephony.com/ allowedOrigins=
Since it is coming from a docker instance and Grafana is load-balanced (the Enterprise key is tied to the load balanced address - not the image renderer Docker named instance). However - I would expect everything to fail if this was an issue - not a partial PDF. (Nor is it specifically related to high CPU after a failure).
When pulling this back to only using the defaults of the docker container - I'm not seeing the CPU issues (still not getting all of the panels rendered though...)
The difference is default (no high CPU) versus clustered (high CPU):
"mode": "default",
"clustering": {
"mode": "browser",
"maxConcurrency": 5
},
and
"mode": "clustered",
"clustering": {
"mode": "browser",
"maxConcurrency": 6
},
@lux4rd0, try to use that:
"mode": "clustered",
"clustering": {
"mode": "context",
"maxConcurrency": 6
},
mode": "context" it's working for my.
Thanks, @tuhnu9089 - I'll give it a shot.