docker-selenium icon indicating copy to clipboard operation
docker-selenium copied to clipboard

[🐛 Bug]: High CPU usage with 4.31 compared to 4.30

Open parholmdahl opened this issue 7 months ago • 26 comments

What happened?

I have tried to update our grid, from 4.29 to 4.31 at two separate occasions. Both times I see an abnormal high use of CPU in the kubernetes grid. Even when the nodes are idleing.

Going from an idle CPU usage on the nodes of 0.01 cpu eqvivalents, to 100% on 1 cpu per node (they are allowed 3 cpu each if they need, soo it does not max out).

Image

Normal testrun cpu usage on the left in the image. At 07:10 i updated to 4.31 on three nodes.

Image

Here are the CPU usage in table form.. Observe three nodes with 0.99 usage, and 2 with 0.01 (the latter is normal).

I can't see anything in any log that gives any hints on why this is happening.. I tried turning of Tracing (could not see anything in the traces anyway), but that did not do any changes..

I have tried to anonymize the setup files in this bugreport..

Command used to start Selenium Grid with Docker (or Kubernetes)

Example deployment file for one node, using both chrome and edge, and adding a mtlscert: 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: selenium-magicnode-XXXXYYYY
  labels:
    app: selenium-magicnode-XXXXYYYY
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      app: selenium-magicnode-XXXXYYYY
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate      
  template:
    metadata:
      labels:
        app: selenium-magicnode-XXXXYYYY
    spec:


      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: 2Gi

      ## Video volume
      - name: selenium-hub-volume
        persistentVolumeClaim:
          claimName: selenium-hub-pvc
      
      ## Custom certs volume
      - name: cert-volume
        secret:
          secretName: XXXXYYYY
          items:
          - key: cert.p12
            path: cert.p12
          - key: password
            path: password                

      containers:
      - name: selenium-magicnode-XXXXYYYY
        image: localImage
        imagePullPolicy: Always
        ports:
          - containerPort: 5555
          - containerPort: 5900
          - containerPort: 4444



        ## Volume mounts
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          ## Video volume mount
          - mountPath: "/videos"
            name: selenium-hub-volume
          ## Custom certs volume mount
          - name: cert-volume
            mountPath: /opt/selenium/certs
            readOnly: true       


        env:
          ## Custom certs setup
          - name: SE_ENABLE_CUSTOM_CERTS
            value: "true"
          - name: CERT_TRUST_ATTR
            value: "TCu,Cu,Tu"       


          ## Event bus setup
          - name: SE_EVENT_BUS_HOST
            value: "URL"
          - name: SE_EVENT_BUS_SUBSCRIBE_PORT
            value: "4443"
          - name: SE_EVENT_BUS_PUBLISH_PORT
            value: "4442"
          - name: SE_NODE_GRID_URL
            value: "https://URL"


          ## Stereotype setup
          - name: SE_NODE_STEREOTYPE
            value: "[{\"browserName\": \"chrome\"}, {\"browserName\": \"MicrosoftEdge\"}]"
          - name: SE_NODE_STEREOTYPE_EXTRA
            value: "{\"platformName\": \"Linux\", \"character:name\": \"XXXXYYYY\", \"networkname:applicationName\":\"XXXXYYYY\", \"nodename:applicationName\":\"XXXXYYYY\"}"
          - name: SE_NODE_NAME
            value: "XXXXYYYY"
          - name: SE_OTEL_SERVICE_NAME
            value: "XXXXYYYY"       
          - name: SE_NODE_MAX_SESSIONS
            value: "1"
          - name: SE_NODE_OVERRIDE_MAX_SESSIONS
            value: "false"


          ## General setup
          - name: GENERATE_CONFIG
            value: "true"
          - name: TZ
            value: "Europe/Stockholm"
          - name: SE_ENABLE_MANAGED_DOWNLOADS
            value: "true"
          - name: SE_VNC_NO_PASSWORD
            value: "1"


          ## Tracing setup
          - name: SE_ENABLE_TRACING
            value: "false"
          - name: SE_OTEL_TRACES_EXPORTER
            value: "otlp"
          - name: SE_OTEL_EXPORTER_ENDPOINT
            value: "http://selenium-hub-jaeger-svc:4317"


          ## Video setup
          - name: SE_RECORD_VIDEO
            value: "true"
          - name: SE_VIDEO_FILE_NAME
            value: "auto"


        ## Security context
        securityContext:
          runAsNonRoot: true
          runAsUser: 1200
          allowPrivilegeEscalation: false
          capabilities:
            drop: 
              - ALL  
          #seccompProfile: 
          #  type: RuntimeDefault


        ## Probes
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /status
            port: 5555
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        startupProbe:
          failureThreshold: 3
          httpGet:
            path: /status
            port: 5555
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /status
            port: 5555
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

        ## Resources (1 cpu per browser at least, and 1 gig of memory per browser)
        resources:
          limits:
            memory: "4Gi"
            cpu: "3"
          requests:
            memory: "100Mi"
            cpu: "0.1"


        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: docker-credentials
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30

Relevant log output

Installing certificates...
2025-04-15T07:08:22.970316001Z Found certificate file
2025-04-15T07:08:22.980665242Z Installing certificate in: /home/seluser/.pki/nssdb
pk12util: no nickname for cert in PKCS12 file.
2025-04-15T07:08:23.015961136Z pk12util: using nickname: XXXXYYYY
2025-04-15T07:08:23.015963188Z pk12util: PKCS12 IMPORT SUCCESSFUL
2025-04-15T07:08:23.021365918Z Virtual environment detected at /opt/venv, activating...
2025-04-15T07:08:23.025128350Z Python 3.12.3
2025-04-15T07:08:23.153845714Z 2025-04-15 09:08:23,153 INFO Included extra file "/etc/supervisor/conf.d/chrome-cleanup.conf" during parsing
2025-04-15T07:08:23.153896896Z 2025-04-15 09:08:23,153 INFO Included extra file "/etc/supervisor/conf.d/edge-cleanup.conf" during parsing
2025-04-15T07:08:23.153899548Z 2025-04-15 09:08:23,153 INFO Included extra file "/etc/supervisor/conf.d/firefox-cleanup.conf" during parsing
2025-04-15T07:08:23.153901289Z 2025-04-15 09:08:23,153 INFO Included extra file "/etc/supervisor/conf.d/recorder.conf" during parsing
2025-04-15 09:08:23,153 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2025-04-15T07:08:23.153933914Z 2025-04-15 09:08:23,153 INFO Included extra file "/etc/supervisor/conf.d/uploader.conf" during parsing
2025-04-15 09:08:23,156 INFO RPC interface 'supervisor' initialized
2025-04-15T07:08:23.156974433Z 2025-04-15 09:08:23,156 INFO supervisord started with pid 15
2025-04-15T07:08:24.159579444Z 2025-04-15 09:08:24,159 INFO spawned: 'video-ready' with pid 16
2025-04-15T07:08:24.160791824Z 2025-04-15 09:08:24,160 INFO spawned: 'xvfb' with pid 17
2025-04-15T07:08:24.161872245Z 2025-04-15 09:08:24,161 INFO spawned: 'vnc' with pid 18
2025-04-15 09:08:24,162 INFO spawned: 'novnc' with pid 19
2025-04-15T07:08:24.164137266Z 2025-04-15 09:08:24,163 INFO spawned: 'video-recording' with pid 20
2025-04-15T07:08:24.165300122Z 2025-04-15 09:08:24,165 INFO spawned: 'selenium-node' with pid 21
2025-04-15T07:08:24.173657199Z 2025-04-15 09:08:24,173 INFO success: video-ready entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2025-04-15T07:08:24.173701086Z 2025-04-15 09:08:24,173 INFO success: video-recording entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2025-04-15T07:08:24.173705054Z 2025-04-15 09:08:24,173 INFO success: selenium-node entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2025-04-15T07:08:24.184559641Z 2025-04-15 07:08:24,183 [endpoint.checks] - Endpoint http://localhost:5555/status is not available - status code: 000
2025-04-15T07:08:24.186096829Z 2025-04-15 07:08:24,185 [video.recorder] - Video folder exists: /videos
2025-04-15 07:08:24,188 [video.recorder] - Waiting for the display localhost:99.0 is open
2025-04-15T07:08:24.635531632Z Appending Selenium option: --register-period 120
2025-04-15T07:08:24.635557410Z Appending Selenium option: --register-cycle 10
2025-04-15T07:08:24.635562310Z Appending Selenium option: --register-shutdown-on-failure 
2025-04-15T07:08:24.635565577Z Appending Selenium option: --heartbeat-period 30
2025-04-15T07:08:24.635608173Z Appending Selenium option: --log-level INFO
2025-04-15T07:08:24.635638529Z Appending Selenium option: --http-logs false
2025-04-15T07:08:24.635709830Z Appending Selenium option: --structured-logs false
2025-04-15T07:08:24.635806723Z Generating Selenium Config
2025-04-15T07:08:24.637985153Z Configuring server...
2025-04-15T07:08:24.637995901Z Setting up SE_NODE_HOST...
Setting up SE_NODE_PORT...
2025-04-15T07:08:24.719084240Z Appending Selenium option: --tracing false
2025-04-15T07:08:24.719124024Z Tracing is disabled
2025-04-15T07:08:24.719135278Z Appending Selenium option: --bind-host false
2025-04-15T07:08:24.719181178Z Appending Selenium option: --config /opt/selenium/config.toml
2025-04-15T07:08:24.719195717Z Selenium Grid Node configuration: 
2025-04-15T07:08:24.719952064Z [events]
2025-04-15T07:08:24.719961462Z     publish = "tcp://URL:4442"
    subscribe = "tcp://URL:4443"
2025-04-15T07:08:24.719965953Z     
2025-04-15T07:08:24.719967871Z [node]
2025-04-15T07:08:24.719970125Z grid-url = "https://URL"
2025-04-15T07:08:24.719972171Z session-timeout = 300
2025-04-15T07:08:24.719974326Z override-max-sessions = false
2025-04-15T07:08:24.719976111Z detect-drivers = false
2025-04-15T07:08:24.719978002Z drain-after-session-count = 0
max-sessions = 1
2025-04-15T07:08:24.719981495Z 
2025-04-15T07:08:24.719983397Z [[node.driver-configuration]]
2025-04-15T07:08:24.719985216Z display-name = "chrome"
2025-04-15T07:08:24.719988649Z stereotype = '{"browserName": "chrome", "platformName": "Linux", "character:name": "XXXXYYYY", "networkname:applicationName": "XXXXYYYY", "nodename:applicationName": "XXXXYYYY"}'
max-sessions = 1
2025-04-15T07:08:24.720012290Z 
2025-04-15T07:08:24.720014083Z [[node.driver-configuration]]
2025-04-15T07:08:24.720016377Z display-name = "MicrosoftEdge"
2025-04-15T07:08:24.720019395Z stereotype = '{"browserName": "MicrosoftEdge", "platformName": "Linux", "character:name": "XXXXYYYY", "networkname:applicationName": "XXXXYYYY", "nodename:applicationName": "XXXXYYYY"}'
2025-04-15T07:08:24.720021222Z max-sessions = 1
2025-04-15T07:08:24.720022874Z 
2025-04-15T07:08:24.720197219Z Starting Selenium Grid Node...
2025-04-15T07:08:24.720326612Z Using JAVA_OPTS:  -Dwebdriver.remote.enableTracing=false -Dwebdriver.httpclient.version=HTTP_1_1
2025-04-15T07:08:25.197832226Z 2025-04-15 07:08:25,197 [video.recorder] - Display localhost:99.0 is open with dimensions 1920x1080
2025-04-15 09:08:25,197 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-04-15T07:08:25.198206730Z 2025-04-15 09:08:25,198 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-04-15T07:08:25.198247735Z 2025-04-15 09:08:25,198 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-04-15T07:08:25.199167220Z 2025-04-15 07:08:25,198 [video.recorder] - Waiting for Node endpoint responds
2025-04-15T07:08:25.200223115Z 09:08:25.195 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
2025-04-15T07:08:25.204500538Z 09:08:25.204 INFO [LoggingOptions.getTracer] - Using null tracer
2025-04-15T07:08:25.263847842Z 09:08:25.263 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://url:4442 and tcp://url:4443
2025-04-15T07:08:25.344603556Z 09:08:25.344 INFO [UnboundZmqEventBus.<init>] - Sockets created
09:08:26.346 INFO [UnboundZmqEventBus.<init>] - Event bus ready
09:08:26.492 INFO [NodeServer.createHandlers] - Reporting self as: http://10.57.2.117:5555
09:08:26.495 INFO [LoggingOptions.getTracer] - Using null tracer
09:08:26.508 INFO [NodeOptions.getSessionFactories] - Detected 3 available processors
09:08:26.547 INFO [NodeOptions.report] - Adding MicrosoftEdge for {"browserName": "MicrosoftEdge","character:name": "XXXXYYYY","networkname:applicationName": "XXXXYYYY","nodename:applicationName": "XXXXYYYY","platformName": "linux","se:noVncPort": 7900,"se:vncEnabled": true} 1 times
09:08:26.548 INFO [NodeOptions.report] - Adding chrome for {"browserName": "chrome","character:name": "XXXXYYYY","networkname:applicationName": "XXXXYYYY","nodename:applicationName": "XXXXYYYY","platformName": "linux","se:noVncPort": 7900,"se:vncEnabled": true} 1 times
09:08:26.580 INFO [Node.<init>] - Binding additional locator mechanisms: relative
09:08:26.714 INFO [NodeServer$2.start] - Starting registration process for Node http://10.57.2.117:5555
09:08:26.715 INFO [NodeServer.execute] - Started Selenium node 4.31.0 (revision 4ae8fc9f8a): http://10.57.2.117:5555
09:08:26.736 INFO [NodeServer$2.lambda$start$2] - Sending registration event...
09:08:26.825 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
2025-04-15 07:08:27,218 [video.recorder] - Node endpoint is responding now. Proceeding next steps...

Operating System

Kubernetes / Ranger / Argo

Docker Selenium version (image tag)

4.31

Selenium Grid chart version (chart version)

parholmdahl avatar Apr 15 '25 07:04 parholmdahl

@parholmdahl, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

github-actions[bot] avatar Apr 15 '25 07:04 github-actions[bot]

In Slack channel we also get same feedback. In this case, can you try to execute into container, and execute command top to see which process consume much resources?

VietND96 avatar Apr 15 '25 10:04 VietND96

Yes!

This is from one that is acting normal (using 4.30):

Image

And this is from one that is high CPU usage (using 4.31):

Image

And what does this tell us? That it is not docker related i guess?

parholmdahl avatar Apr 15 '25 11:04 parholmdahl

To classify the issue from Grid core or browser/driver version. You can setup few Node with image tag using Grid 4.31.0 with browser/driver Chrome/Edge v134. You can checkout image tags available https://github.com/SeleniumHQ/docker-selenium/blob/trunk/CHANGELOG/4.31.0/edge_134.md, https://github.com/SeleniumHQ/docker-selenium/blob/trunk/CHANGELOG/4.31.0/chrome_134.md

VietND96 avatar Apr 17 '25 06:04 VietND96

Sorry, I have been in the middle of a release, so I have not had the time to test.. But today i tested with the nightly build (guessing this would be eqvivalent with the non built 4.32 release?)

This is what I see in "top" inside the container:

Image

So the problem i still present.

I will try to build with another Chrome version..

parholmdahl avatar May 05 '25 07:05 parholmdahl

I tried deploying 4.32 with "our old way" (that is first build an image that contains chrome, edge and firefox, and then add mTLS certificate to that images in another build).. Same issue with around 10-12% cpu on that java process..

I then thought to skip the last buildstep, so no mTLS adding.. Same issue..

As a last resort i did a "vanilla take" with the following deploycode:

`apiVersion: apps/v1 kind: Deployment metadata: name: selenium-vanillanode-invanare labels: app: selenium-vanillanode-invanare spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 0 selector: matchLabels: app: selenium-vanillanode-invanare strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate
template: metadata: labels: app: selenium-vanillanode-invanare spec:

  volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: 2Gi

  ## Video volume
  - name: selenium-hub-volume
    persistentVolumeClaim:
      claimName: selenium-hub-pvc
      

  containers:
  - name: selenium-vanillanode-invanare
    image: selenium/node-chrome:134.0.6998.165-chromedriver-134.0.6998.165-grid-4.31.0-20250414
    imagePullPolicy: Always
    ports:
      - containerPort: 5555
      - containerPort: 5900
      - containerPort: 4444



    ## Volume mounts
    volumeMounts:
      - mountPath: /dev/shm
        name: dshm
      ## Video volume mount
      - mountPath: "/videos"
        name: selenium-hub-volume


    env:
      ## Event bus setup
      - name: SE_EVENT_BUS_HOST
        value: "<OUR INTERNAL HOST FOR HUB>"
      - name: SE_EVENT_BUS_SUBSCRIBE_PORT
        value: "4443"
      - name: SE_EVENT_BUS_PUBLISH_PORT
        value: "4442"
      - name: SE_NODE_GRID_URL
        value: "<OUR INTERNAL HOST FOR HUB>"


      ## Stereotype setup
      - name: SE_NODE_STEREOTYPE
        value: "{\"browserName\": \"chrome\"}"
      - name: SE_NODE_STEREOTYPE_EXTRA
        value: "{\"platformName\": \"Linux\", \"character:name\": \"invanare\", \"networkname:applicationName\":\"invanare\", \"nodename:applicationName\":\"invanare\"}"
      - name: SE_NODE_NAME
        value: "vanillanode-invanare"
      - name: SE_OTEL_SERVICE_NAME
        value: "vanillanode-invanare"       
      - name: SE_NODE_MAX_SESSIONS
        value: "1"
      - name: SE_NODE_OVERRIDE_MAX_SESSIONS
        value: "false"


      ## General setup
      - name: GENERATE_CONFIG
        value: "true"
      - name: TZ
        value: "Europe/Stockholm"
      - name: SE_ENABLE_MANAGED_DOWNLOADS
        value: "true"
      - name: SE_VNC_NO_PASSWORD
        value: "1"


      ## Tracing setup
      - name: SE_ENABLE_TRACING
        value: "false"
      - name: SE_OTEL_TRACES_EXPORTER
        value: "otlp"
      - name: SE_OTEL_EXPORTER_ENDPOINT
        value: "<OUR INTERNAL HOST FOR JAEGER>"


      ## Video setup
      - name: SE_RECORD_VIDEO
        value: "true"
      - name: SE_VIDEO_FILE_NAME
        value: "auto"


    ## Security context
    securityContext:
      runAsNonRoot: true
      runAsUser: 1200
      allowPrivilegeEscalation: false
      capabilities:
        drop: 
          - ALL  
      #seccompProfile: 
      #  type: RuntimeDefault


    ## Probes
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /status
        port: 5555
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    startupProbe:
      failureThreshold: 3
      httpGet:
        path: /status
        port: 5555
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /status
        port: 5555
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

    ## Resources (1 cpu per browser at least, and 1 gig of memory per browser)
    resources:
      limits:
        memory: "4Gi"
        cpu: "3"
      requests:
        memory: "100Mi"
        cpu: "0.1"


    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  dnsPolicy: ClusterFirst
  imagePullSecrets:
  - name: docker-credentials
  restartPolicy: Always
  schedulerName: default-scheduler
  terminationGracePeriodSeconds: 30            `

But the problem persists.. Still 10-12% cpu when using "top" command:

Image

One thing that i have observed is that the new problematic builds seem to find an environment under /opt/venv and enters that environment at the very start of the pod..

Don't know if this has anything at all to do with the problem..

Image

parholmdahl avatar May 05 '25 14:05 parholmdahl

Tested latest nightly build today, to see if the problem persists, and unfortunately it does..

parholmdahl avatar May 20 '25 08:05 parholmdahl

I also tested to upgrade the hub to 4.32 and the hub has no issues with high CPU. Then i tried to run the lates nightly build (as vanilla, no additions or updates) but still high CPU from that node..

parholmdahl avatar May 20 '25 09:05 parholmdahl

Today i tried running the latest nightly build again, with no additions (straight from hub.docker.com), but i added SE_LOG_LEVEL to environment, and set it to DEBUG.. Then i got a crashloop with this log:

Virtual environment detected at /opt/venv, activating... Python 3.12.3 2025-05-26 18:01:16,609 INFO Included extra file "/etc/supervisor/conf.d/chrome-cleanup.conf" during parsing 2025-05-26T16:01:16.609596037Z 2025-05-26 18:01:16,609 INFO Included extra file "/etc/supervisor/conf.d/recorder.conf" during parsing 2025-05-26T16:01:16.609602595Z 2025-05-26 18:01:16,609 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing 2025-05-26T16:01:16.609668239Z 2025-05-26 18:01:16,609 INFO Included extra file "/etc/supervisor/conf.d/uploader.conf" during parsing 2025-05-26T16:01:16.612496099Z 2025-05-26 18:01:16,612 INFO RPC interface 'supervisor' initialized 2025-05-26T16:01:16.612777477Z 2025-05-26 18:01:16,612 INFO supervisord started with pid 9 2025-05-26T16:01:17.615266940Z 2025-05-26 18:01:17,614 INFO spawned: 'video-ready' with pid 10 2025-05-26T16:01:17.617358486Z 2025-05-26 18:01:17,617 INFO spawned: 'xvfb' with pid 11 2025-05-26T16:01:17.618946686Z 2025-05-26 18:01:17,618 INFO spawned: 'vnc' with pid 12 2025-05-26T16:01:17.620598526Z 2025-05-26 18:01:17,620 INFO spawned: 'novnc' with pid 13 2025-05-26T16:01:17.622225852Z 2025-05-26 18:01:17,622 INFO spawned: 'video-recording' with pid 14 2025-05-26T16:01:17.624020341Z 2025-05-26 18:01:17,623 INFO spawned: 'selenium-node' with pid 15 2025-05-26T16:01:17.634112111Z 2025-05-26 18:01:17,634 INFO success: video-ready entered RUNNING state, process has stayed up for > than 0 seconds (startsecs) 2025-05-26 18:01:17,634 INFO success: video-recording entered RUNNING state, process has stayed up for > than 0 seconds (startsecs) 2025-05-26T16:01:17.634405737Z 2025-05-26 18:01:17,634 INFO success: selenium-node entered RUNNING state, process has stayed up for > than 0 seconds (startsecs) 2025-05-26T16:01:17.635227439Z 2025-05-26 16:01:17,634 [video.recorder] - Video folder exists: /videos 2025-05-26T16:01:17.638692975Z 2025-05-26 16:01:17,638 [video.recorder] - Waiting for the display localhost:99.0 is open Appending Selenium option: --register-period 120 Appending Selenium option: --register-cycle 10 Appending Selenium option: --register-shutdown-on-failure Appending Selenium option: --heartbeat-period 30 Appending Selenium option: --log-level DEBUG Appending Selenium option: --http-logs false Appending Selenium option: --structured-logs false Generating Selenium Config Configuring server... Setting up SE_NODE_HOST... Setting up SE_NODE_PORT... Merging SE_NODE_STEREOTYPE_EXTRA={"platformName": "Linux", "character:name": "invanare", "networkname:applicationName":"invanare", "nodename:applicationName":"invanare"} to main stereotype Merged stereotype: {"browserName":"chrome","platformName":"Linux","character:name":"invanare","networkname:applicationName":"invanare","nodename:applicationName":"invanare"} Appending Selenium option: --tracing false 2025-05-26T16:01:18.033574727Z Tracing is disabled Appending Selenium option: --bind-host false Appending Selenium option: --config /opt/selenium/config.toml 2025-05-26T16:01:18.034349475Z Selenium Grid Node configuration: [events] 2025-05-26T16:01:18.035372086Z publish = "tcp://selenium-hub-svc.selenium-hub.svc.cluster.local:4442" 2025-05-26T16:01:18.035399616Z subscribe = "tcp://selenium-hub-svc.selenium-hub.svc.cluster.local:4443" 2025-05-26T16:01:18.035406356Z [node] 2025-05-26T16:01:18.035417330Z grid-url = "https://selenium-hub.tnmt.nordicmedtest.se" 2025-05-26T16:01:18.035422827Z session-timeout = 300 2025-05-26T16:01:18.035428297Z override-max-sessions = false 2025-05-26T16:01:18.035433379Z detect-drivers = false 2025-05-26T16:01:18.035439058Z drain-after-session-count = 0 max-sessions = 1 2025-05-26T16:01:18.035449606Z 2025-05-26T16:01:18.035455121Z [[node.driver-configuration]] display-name = "chrome" 2025-05-26T16:01:18.035468582Z stereotype = '{"browserName":"chrome","platformName":"Linux","character:name":"invanare","networkname:applicationName":"invanare","nodename:applicationName":"invanare"}' max-sessions = 1 2025-05-26T16:01:18.035479320Z Starting Selenium Grid Node... Using JAVA_OPTS: -Dwebdriver.remote.enableTracing=false -Dwebdriver.httpclient.version=HTTP_1_1 2025-05-26 18:01:18,431 WARN exited: selenium-node (exit status 1; not expected) 2025-05-26T16:01:18.431276931Z 2025-05-26 18:01:18,431 WARN received SIGINT indicating exit request 2025-05-26T16:01:18.431373533Z 2025-05-26 18:01:18,431 INFO waiting for video-ready, xvfb, vnc, novnc, video-recording to die 2025-05-26 16:01:18,647 [video.recorder] - Display localhost:99.0 is open with dimensions 1920x1080 2025-05-26T16:01:18.647711711Z 2025-05-26 18:01:18,647 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2025-05-26T16:01:18.647721684Z 2025-05-26 18:01:18,647 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2025-05-26T16:01:18.647759318Z 2025-05-26 18:01:18,647 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2025-05-26 16:01:18,649 [video.recorder] - Waiting for Node endpoint responds 2025-05-26 16:01:18,654 [video.recorder] - Trapped SIGTERM/SIGINT/x so shutting down recorder 2025-05-26 18:01:18,661 WARN received SIGTERM indicating exit request 2025-05-26 16:01:18,662 [video.recorder] - Ready to shutdown the recorder 2025-05-26 16:01:18,663 [video.recorder] - Trapped SIGTERM/SIGINT/x so shutting down recorder 2025-05-26 16:01:18,669 [video.recorder] - Ready to shutdown the recorder 2025-05-26 18:01:18,669 WARN received SIGTERM indicating exit request 2025-05-26 18:01:18,669 INFO stopped: video-recording (exit status 0) 2025-05-26 18:01:19,172 WARN stopped: novnc (terminated by SIGTERM) 2025-05-26 18:01:19,173 WARN stopped: vnc (terminated by SIGTERM) 2025-05-26 18:01:20,175 WARN stopped: xvfb (terminated by SIGTERM) 2025-05-26 18:01:20,176 WARN stopped: video-ready (terminated by SIGKILL)

parholmdahl avatar May 26 '25 16:05 parholmdahl

Actually, DEBUG is not a support value, the right values should be FINE, FINER, FINEST After this https://github.com/SeleniumHQ/selenium/issues/15790, we will have a better hint for the user to acknowledge.

VietND96 avatar May 26 '25 16:05 VietND96

Oh! My bad.. I will try FINEST then..

parholmdahl avatar May 26 '25 16:05 parholmdahl

Tried FINEST now, and it just floooooded the logs (remember, the node is not up for users to use.. It should be "silent"?)...

selenium-vanillanode-invanare-f86f58844-9n72r_selenium-vanillanode-invanare.log

parholmdahl avatar May 26 '25 16:05 parholmdahl

It looks like the /status endpoint gets a new GET request like 2000 times a second? Seems a bit rough?

parholmdahl avatar May 26 '25 16:05 parholmdahl

If 2000 times a second, in this case, how many Nodes are connecting to the Hub at that time?

  • If /status endpoint of Node is hitting a lot, it might come from video recorder container, container health check, heartbeat interval (Hub audit Node status)
  • If /status endpoint of Hub is hitting a lot, it might container health check, or external traffic fetches Hub status for a some tasks

VietND96 avatar May 26 '25 16:05 VietND96

This is on a Node (with image selenium/node-chrome:nightly).. If i change to a node of 4.30 version (selenium/node-chrome:4.30) I don't have the elevated CPU usage, and don't see this many logs in FINEST level.. Same setup, only change image...

I asked Claude AI to compare the two logfiles, and this is their conclusion:

We've identified a significant performance issue with the latest nightly build of selenium/node-chrome when using SE_LOG_LEVEL: "FINEST". The nightly version produces approximately 47x more log output compared to version 4.30, causing severe resource consumption and container crashes. Environment Kubernetes deployment with health checks configured Health check frequency: Every 10 seconds via /status endpoint Additional monitoring: ~1 request per second to /status endpoint Log level: FINEST Comparison Results Version 4.30 (selenium/node-chrome:4.30): Runtime: 46 seconds before termination Total log lines: 1,278 Log frequency: ~28 lines per second Behavior: Stable operation, normal termination Nightly version (selenium/node-chrome:nightly): Runtime: 13 seconds before termination Total log lines: 16,963 Log frequency: ~1,305 lines per second Behavior: Excessive CPU usage (~10% constant), early container crash Analysis The nightly version appears to have either: Significantly more verbose FINEST logging implementation A logging bug causing excessive output Both issues combined Each /status HTTP request in the nightly version generates substantially more debug output compared to version 4.30, creating a feedback loop where the logging itself becomes a resource burden. Impact Container crashes due to resource exhaustion High CPU consumption even when idle Unusable performance with FINEST log level 47x increase in log volume compared to stable version

I think that this is why I see that high CPU usage in later versions (I still run all our nodes on 4.30 because of this)...

parholmdahl avatar May 26 '25 18:05 parholmdahl

I am shooting from the hip here, but could it have something to do with this change: https://github.com/SeleniumHQ/selenium/pull/15448 ?

I have not really checked the code changes (I SUCK at JAVA).. But it seems to handle status checks for more values than before?

parholmdahl avatar May 26 '25 18:05 parholmdahl

That PR tried to add one more attributeregistered, to the status response. But probably it might cause the hitting, since some logic tries to check node.getAvailability() == UP. I will investigate this point.

VietND96 avatar May 26 '25 18:05 VietND96

I tried to use TCP probe only from kubernetes: ## Probes startupProbe: tcpSocket: port: 5555 periodSeconds: 10

And got rid of the other probes.. But i still got high CPU, and when AI controlled the log, it says that it has approx 295 logs per second to the /status endpoint (I am guessing theese come from the hub then?).. compared to 4.30 version that has about 1 per second..

This smells fishy! :)

parholmdahl avatar May 26 '25 18:05 parholmdahl

@VietND96 @parholmdahl After taking a look at the logs that you've provided and reading the whole thread I want to chime in here.

So I think all this polling comes from the video feature that is activated and there are a couple of things that point into that direction in the logs:

  1. The requests seem to come from curl which makes me thing it comes directly from some kind of script instead of the hub user-agent. Java status request:
GET /status HTTP/1.1
Host: 10.57.3.44:5555
user-agent: selenium/4.32.0 (java unix)

'Spammy ' status request:

GET /status HTTP/1.1
Host: localhost:5555
User-Agent: curl/8.11.0
Accept: */* 
  1. The 'spammy' status requests seem to start after the following log line Node endpoint is responding now. Proceeding next steps..., even though you have an initial status request before that.

After doing a quick search through the project I see that the log line highlighted on the second point comes from this script (https://github.com/SeleniumHQ/docker-selenium/blob/bb903724ce9046dfdd7907d2b480237d39ab1587/Video/video.sh). I suck at bash but my guess is that something fishy is happening on that script at line 252 onwards, especially the curl request that is sent there.

You can confirm this by either deactivating the video feature:

      - name: SE_RECORD_VIDEO
        value: "false"

---or more of a longshot on my end, getting rid of the double quotes around the session_id on line 253 since it might be doing some weird string interpolation.

session_id=$(jq -r "${JQ_SESSION_ID_QUERY}" "/tmp/status.json")---

PS: Disregard last part since I don't think that's the actual issue.

Oxilod avatar May 27 '25 17:05 Oxilod

@Oxilod, thank you for your detailed analysis. Actually, one limitation of video recording is relying on /status to trigger start/stop recording per session, and it fetches the status very frequently

poll_interval=${SE_VIDEO_POLL_INTERVAL:-1}

If your test in a session took a few minutes, you can increase this poll interval via this env var SE_VIDEO_POLL_INTERVAL to 10 or 20 (unit in seconds) to see any improvement.

VietND96 avatar May 27 '25 18:05 VietND96

But correct me if I am wrong. You had the video solution already in 4.30 already (I am not having standalone video containers). So this must be a combination of factors, where the new variables on /status is the culprit.

parholmdahl avatar May 27 '25 18:05 parholmdahl

You're right, increasing the poll interval of the video recorder will not be the final resolution for this case, it would be a workaround to evaluate how it helps to improve at this time before having the next release with a possible fix in Grid core.

VietND96 avatar May 27 '25 18:05 VietND96

So, i tried theese last workarounds.

Setting "SE_RECORD_VIDEO" to false, had a good impact on the CPU load (back to 0-0.1 in idle)

Resetting "SE_RECORD_VIDEO" to true, and changing "SE_VIDEO_POLL_INTERVAL" to 20, did not change the CPU load. Then we were back at at least 10% CPU on idle.

This was with latest nightly in "vanilla" mode. "selenium/node-chrome:nightly"

parholmdahl avatar May 28 '25 08:05 parholmdahl

Ah ok, so video recording script do something consumes much CPU, let me try to deliver the improvement soon!

VietND96 avatar May 28 '25 14:05 VietND96

Please check image tag 20250606. I will continue improve this in next Grid core release

VietND96 avatar Jun 07 '25 13:06 VietND96

I deployed the version that you recommended, and it looks right now that the problem is gone.. I let this run throught the day, and see if it persists.

parholmdahl avatar Jun 09 '25 08:06 parholmdahl

I had no more issues after this, i guess we see this fix in 4.34?

parholmdahl avatar Jun 23 '25 11:06 parholmdahl

I am closing this with resolution in #2856

VietND96 avatar Jul 17 '25 17:07 VietND96

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Aug 16 '25 23:08 github-actions[bot]