coolify icon indicating copy to clipboard operation
coolify copied to clipboard

[Bug]: High CPU Usage

Open chihebnabil opened this issue 10 months ago • 10 comments

I'm experiencing consistently high CPU usage (100%) on my Coolify instance, and after investigation, I've determined that the coolify-proxy container is the primary culprit.

  • Consistently high CPU usage on the server.
  • docker stats shows the coolify-proxy container consuming a very large percentage of CPU.
  • The coolify-proxy logs reveal numerous errors related to Let's Encrypt certificate renewal failures.

Root Cause:

The logs indicate that the coolify-proxy (which uses Traefik in my case) is repeatedly attempting to renew SSL certificates for domain names that are no longer in use

Expected Behavior:

Coolify should automatically remove or stop attempting to renew certificates for domains that are no longer associated with any resources in the Coolify UI.

Image

I hope this information is helpful. Please let me know if you need any further details.

Steps to Reproduce

/

Example Repository URL

No response

Coolify Version

v4.0.0-beta.397

Are you using Coolify Cloud?

No (self-hosted)

Operating System and Version (self-hosted)

24.04

Additional Information

No response

chihebnabil avatar Mar 15 '25 22:03 chihebnabil

What Traefik version are you running? This is at the moment a known problem with Traefik v3.1.X. According to this Issue on the Traefik Github Repo, it helped for some people to downgrade Traefik to v3.0.X.

If that still doesn't work, you probably want to follow the issue I linked above, as this seems to be a problem with Traefik itself. Changing your proxy to Caddy could be ofc also a solution.

Cinzya avatar Mar 16 '25 22:03 Cinzya

What Traefik version are you running? This is at the moment a known problem with Traefik v3.1.X. According to this Issue on the Traefik Github Repo, it helped for some people to downgrade Traefik to v3.0.X.

If that still doesn't work, you probably want to follow the issue I linked above, as this seems to be a problem with Traefik itself. Changing your proxy to Caddy could be ofc also a solution.

I'm using the default Coolify Traefik config, which currently uses v3.1.

If this is a known issue, shouldn't the default be set to a stable version like v3.0.X instead? Would downgrading be the best workaround for now?

name: coolify-proxy
networks:
  coolify:
    external: true
services:
  traefik:
    container_name: coolify-proxy
    image: 'traefik:v3.1'
    restart: unless-stopped
    extra_hosts:
      - 'host.docker.internal:host-gateway'
    networks:
      - coolify
    ports:
      - '80:80'
      - '443:443'
      - '443:443/udp'
      - '8080:8080'
    healthcheck:
      test: 'wget -qO- http://localhost:80/ping || exit 1'
      interval: 4s
      timeout: 2s
      retries: 5
    volumes:
      - '/var/run/docker.sock:/var/run/docker.sock:ro'
      - '/data/coolify/proxy:/traefik'
    command:
      - '--ping=true'
      - '--ping.entrypoint=http'
      - '--api.dashboard=true'
      - '--entrypoints.http.address=:80'
      - '--entrypoints.https.address=:443'
      - '--entrypoints.http.http.encodequerysemicolons=true'
      - '--entryPoints.http.http2.maxConcurrentStreams=250'
      - '--entrypoints.https.http.encodequerysemicolons=true'
      - '--entryPoints.https.http2.maxConcurrentStreams=250'
      - '--entrypoints.https.http3'
      - '--providers.file.directory=/traefik/dynamic/'
      - '--providers.file.watch=true'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge=true'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=http'
      - '--certificatesresolvers.letsencrypt.acme.storage=/traefik/acme.json'
      - '--api.insecure=false'
      - '--providers.docker=true'
      - '--providers.docker.exposedbydefault=false'
    labels:
      - traefik.enable=true
      - traefik.http.routers.traefik.entrypoints=http
      - traefik.http.routers.traefik.service=api@internal
      - traefik.http.services.traefik.loadbalancer.server.port=8080
      - coolify.managed=true
      - coolify.proxy=true

FYI i endup limiting the cpu usage for the Traefik

deploy:
      resources:
        limits:
          cpus: '1.0' 
          memory: 512M

chihebnabil avatar Mar 17 '25 13:03 chihebnabil

I'm using the default Coolify Traefik config, which currently uses v3.1.

If this is a known issue, shouldn't the default be set to a stable version like v3.0.X instead? Would downgrading be the best workaround for now?

We are still trying to figure out what the best workaround is for this bug. As the Traefik developers don't seem to be so sure about it either. It was previously recommended to update to v3.2 instead, but that didn't work for another Coolify user, the latest post about downgrading to v3.0 isn't that old as you can tell.

So if you can confirm that downgrading to v3.0 worked for you, we can consider locking new coolify installation to use v3.0.

Misconfigurations of Traefik can apparently also cause high CPU usage, so that's also something we need to potentially look at.

Cinzya avatar Mar 17 '25 19:03 Cinzya

Thanks for replying @Cinzya

I tried switching between multiple versions, but the issue persisted across all of them. The only workaround that somewhat helped was limiting the resources, but even then, I still noticed some weird behavior when applying resource limits.

chihebnabil avatar Mar 18 '25 14:03 chihebnabil

Gotcha, I would recommend you leave a comment on the Issue at the Traefik GitHub then. The most effective way will be to get this fixed by the Traefik Developers themselves and it looks like they are still in need of more information from affected people.

But I'll bring the Coolify Devs attention to this as well. Maybe changing the configs could help in some way too.

You also need to keep in mind that high traffic will also cause high CPU usage on Traefik. But I've heard from people that they still have this issue with barely any visitors.

Cinzya avatar Mar 18 '25 15:03 Cinzya

@chihebnabil I don’t believe Traefik (coolify-proxy) itself is the main culprit. It’s possible the high CPU usage is related to Sentinel. Could you try disabling Sentinel temporarily to see if that resolves the issue? Let us know how it goes and if the CPU usage drops once Sentinel is disabled.

remiilekun avatar Mar 20 '25 00:03 remiilekun

@remiilekun Well, Sentinel was disabled. I enabled it just to check the CPU/memory usage for my apps to make sure it’s not caused by that. Basically, no difference between enabling or disabling Sentinel

chihebnabil avatar Mar 20 '25 00:03 chihebnabil

@chihebnabil got it, thanks for confirming

remiilekun avatar Mar 20 '25 00:03 remiilekun

I try to deploy the dify, and the CPU usage is 100% when I just filled the git address. I guess there is an issue on large env var list.

v4.0.0-beta.397

A-limon avatar Mar 24 '25 02:03 A-limon

@chihebnabil I ran into a similar issue with Traefik in production. Are your applications already handling some load? In my case, the issue wasn’t with Traefik itself, but rather how proxying was managed with the web apps / services. Since you're using HTTP/3 and HTTP/2, you might want to check for the same issue. Traefik was upgrading traffic to HTTP/2 while still communicating internally over HTTP/1.1, leading to high CPU usage. I resolved it by ensuring the same protocol stack was used both internally and externally.

appreciated avatar Mar 25 '25 19:03 appreciated