docker-selenium
docker-selenium copied to clipboard
[🐛 Bug]: Memory leak on hub and nodes when using google kubernetes (v4.26)
What happened?
### After executing some tests with paralellization, the resources baseline increases a couple without never reaching the previous baseline.
For example purposes, we start the hub and nodes, and the average memory used is like 500MB. After running tests a first time and resources reach their peak (2GB), the new resource baseline is near 700MB. On the next run, the new baseline is 800MB, and so on and so on. Obviously this number is not accurate and is random, but it is visible that there is a baseline memory value increase, influencing a fast trigger of the OOMKilled event on the nodes in the middle of a test execution.
On our automation test Google Cloud infra-structure, we use Google Kubernetes Engine (GKE) to host the Selenium Hub and Selenium Nodes on different pods of the same namespace. There is a pod with one replica of Selenium Hub and one pod with 5 replicas of Selenium Node (chrome), with 8 max-sessions each (total of 40 threads).
We used to run the tests sequentially but don't have an idea if this problem occured with this setup, but we remember sometimes we had some problems and needed to force a restart. Either way, with paralellization enabled this problem is more persistent and led us to increase the resources, in order have a better buffer and minimize this occurrence.
The actual resources configuration of the chrome nodes is the following:
resources:
limits:
cpu: "10"
memory: 2560Mi
requests:
cpu: "2"
memory: 2560Mi
The actual node configurations (collected from the UI):
OS Arch: amd64
OS Name: Linux
OS Version: 6.1.85+
Total slots: 8
Grid version: 4.26.0 (revision 69f9e5e)
We have identified that the selenium-servar.jar process is the main cause of the resource consumption as we can verify in the following images.
Before running tests and after a restart:
After running tests:
Questions:
I've read somewhere that in order to minimize this effect we can use the following parameter:
**--drain-after-session-count** to drain and shutdown the Node after X sessions have been executed. Useful for environments like Kubernetes. A value higher than zero enables this feature.
- Does it mean that Selenium Grid is not optimized to work with Kubernetes?
- In that case, it is mandatory to use this parameter?
- What are the drawbacks of using it? Will the test executions still be stable?
- Is it possible to define the X number of sessions and the node still be active processing test requests/sessions from the hub in case it should drain in the middle of a test execution?
- Should we use this approach or event restart the nodes at the end of each test execution, or for example, with a scheduled restart in the gitlab pipelines every day at a time where is certain nobody is using the grid?
Command used to start Selenium Grid with Docker (or Kubernetes)
global:
selenium:
imageTag: 4.0
nodesImageTag: 4.0
isolateComponents: false
busConfigMap:
name: selenium-event-bus-config
annotations: {}
hub:
enabled: true
host: ~
imageName: /selenium/hub
imageTag: 4.26.0
imagePullPolicy: IfNotPresent
annotations: {}
labels: {}
publishPort: - # removed for privacy/security purposes
subscribePort: - # removed for privacy/security purposes
port: - # removed for privacy/security purposes
ingress:
enabled: true
path: /
host: "-" # removed for privacy/security purposes
annotations:
kubernetes.io/ingress.class: nginx
tls:
enabled: false
secretName: selenium-hub-tls
livenessProbe:
enabled: true
path: /wd/hub/status
initialDelaySeconds: 10
failureThreshold: 10
timeoutSeconds: 10
periodSeconds: 10
successThreshold: 1
readinessProbe:
enabled: true
path: /wd/hub/status
initialDelaySeconds: 12
failureThreshold: 10
timeoutSeconds: 10
periodSeconds: 10
successThreshold: 1
extraEnvironmentVariables:
- name: SE_DISTRIBUTOR_MAX_THREADS
value: "50"
- name: SE_ENABLE_TRACING
value: "false"
resources: {}
serviceType: ClusterIP
serviceAnnotations: {}
tolerations: []
nodeSelector: {}
chromeNode:
enabled: true
replicas: 5
autoscale:
enabled: false
minReplicas: 2
maxReplicas: 5
pollingInterval: 30
cooldownPeriod: 300
imageName: selenium/node-chrome
imageTag: 130.0
imagePullPolicy: IfNotPresent
ports:
- - # removed for privacy/security purposes
- - # removed for privacy/security purposes
seleniumPort: - # removed for privacy/security purposes
seleniumServicePort: - # removed for privacy/security purposes
annotations: {}
labels: {}
resources:
limits:
cpu: '10'
memory: 2560Mi
requests:
cpu: '2'
memory: 2560Mi
tolerations: []
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node_pool
operator: In
values:
- standard
antiAffinity: "hard"
podAffinityPreset: ""
podAntiAffinityPreset: soft
nodeAffinityPreset:
values: [ ]
extraEnvironmentVariables:
- name: SE_NODE_MAX_SESSIONS
value: "10"
- name: SE_NODE_MAX_THREADS
value: "10"
service:
enabled: true
type: ClusterIP
annotations: {}
terminationGracePeriodSeconds: 300
dshmVolumeSizeLimit: 1Gi
customLabels: {}
Relevant log output
N/A
Operating System
Kubernetes (GKE)
Docker Selenium version (image tag)
4.26.0 (revision 69f9e5e)
Selenium Grid chart version (chart version)
No response
@NicoIodice, thank you for creating this issue. We will troubleshoot it as soon as we can.
Info for maintainers
Triage this issue by using labels.
If information is missing, add a helpful comment and then I-issue-template label.
If the issue is a question, add the I-question label.
If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable G-* label, and it will provide the correct link and auto-close the
issue.
After troubleshooting the issue, please add the R-awaiting answer label.
Thank you!
@joerg1985, on the memory consuming, do you have any idea?
This might be related to the chrome processes not terminated propper e.g. related to the --no-sandbox arg or another chromedriver issue.
@NicoIodice Could you switch to firefox to confirm this is related to chrome/edge?
As the driver / browser is a child process of the server the overview in the screenshot might summarize it to the server memory consumption.
@joerg1985 Thank your for your answer and suggestion.
Regarding the latter, we will try to configure firefox to have values and behaviors to compare to.
Regarding the driver, we make sure and have logs to confirm the web driver is correctly quit, and on dynatrace (screenshots I have attached to the issue description) we can validate that they are not there anymore. During the test execution we can cleary visualize the chrome diver process instances running, but it is guaranteed they quit correctly at the end of the test scenario.
Additionaly, the driver has the --no-sandbox setting. I can leave here the current chrome web driver settings that we use:
[DEBUG] [ForkJoinPool-2-worker-23] 26-11-2024 00:27:14.357 PEDLV-28594:: Web Driver options: Capabilities {acceptInsecureCerts: true, browserName: chrome, goog:chromeOptions: {args: [--safebrowsing-disable-down..., --disable-web-security, --disable-gpu, --safebrowsing-disable-exte..., --incognito, --disable-extensions, --headless, --window-size=1920,1080, --allow-running-insecure-co..., --no-sandbox, --disable-search-engine-cho..., --remote-allow-origins=*, --disable-dev-shm-usage, --ignore-certificate-errors], extensions: [], prefs: {download.default_directory: /home/seluser/Downloads/PEMP, download.extensions_to_open: application/pdf, applicatio..., download.prompt_for_download: false, download.prompt_for_download.show_notification: false, plugins.plugins_disabled: [Adobe Flash Player, Chrome PDF Viewer], profile.default_content_setting_values.automatic_downloads: 1, profile.default_content_settings.popups: 0, safebrowsing.enabled: false, safebrowsing_for_trusted_sources_enabled: false}}, se:branch: develop, se:browser: CHROME, se:headless: true, se:module_parallelization: false}
Is there any tool to verify if there is any instance related to the web driver or something related to it?
The chromedriver does only spawn chrom processes, so you could search for chrome inside the running processes.
To confirm it is a java leak, you could create a memory histogram with jmap, before running tests and after running tests. https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr014.html#BABFEDHC