docker-selenium
docker-selenium copied to clipboard
[🐛 Bug]: Each test-execution starts multiple jobs
What happened?
When I start selenium tests using the grid there are always two jobs started.
There is one started immediately. Once it is up-and-running a second is scheduled. The second job is then used for the test. After the test is done only the second is finished. The other keeps on running (doing nothing). Today I stopped one, that was in running state the whole weekend.
The second is only started up a soon as the first is ready. I saw this when it was scheduled on a node, that did not have the image yet. It took about 2:30m to pull it. Only after that was done the second job got scheduled. First I thought this might have something to do with a timeout, because it took to long to pull the image. But it also happens if the image is available and the first job only takes seconds to get ready.
Command used to start Selenium Grid with Docker
I installed the Grid from the Helm chart using an existing KEDA installation.
selenium-grid:
ingress:
enabled: true
[ ... ]
hub:
extraEnvironmentVariables:
- name: TZ
value: Europe/Berlin
resources:
limits:
memory: 2Gi
requests:
cpu: 50m
memory: 2Gi
autoscaling:
enableWithExistingKEDA: true
scalingType: job
chromeNode:
enabled: true
maxReplicaCount: 16
extraEnvironmentVariables:
- name: TZ
value: Europe/Berlin
firefoxNode:
enabled: true
maxReplicaCount: 8
extraEnvironmentVariables:
- name: TZ
value: Europe/Berlin
edgeNode:
enabled: false
My Kubernetes cluster is in version 1.23.
Relevant log output
I only put the KEDA log in the form, as I could not see any interesting output in the Grid logs.
2023-07-30T12:56:19Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:19Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:19Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:19Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:19Z INFO scaleexecutor Creating jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Effective number of max jobs": 1}
2023-07-30T12:56:19Z INFO scaleexecutor Creating jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:19Z INFO scaleexecutor Created jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:29Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:29Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:29Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 1}
2023-07-30T12:56:29Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:29Z INFO scaleexecutor Creating jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Effective number of max jobs": 1}
2023-07-30T12:56:29Z INFO scaleexecutor Creating jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:29Z INFO scaleexecutor Created jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:39Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 2}
2023-07-30T12:56:39Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:39Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:39Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:49Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:49Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:49Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 2}
2023-07-30T12:56:49Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
Operating System
Kubernetes 1.23 on Flatcar Linux
Docker Selenium version (tag)
4.10.0-20230607
@maxnitze, thank you for creating this issue. We will troubleshoot it as soon as we can.
Info for maintainers
Triage this issue by using labels.
If information is missing, add a helpful comment and then I-issue-template label.
If the issue is a question, add the I-question label.
If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable G-* label, and it will provide the correct link and auto-close the
issue.
After troubleshooting the issue, please add the R-awaiting answer label.
Thank you!
Can you share the test script you are using to see this behavior?
Hey @diemol ,
I asked the KEDA project as well. And it seems the issue is with the scalingStrategy. When I set it to default it works.
See here: https://github.com/kedacore/keda/issues/4833
Is there any specific reason the default is set to accurate in this Chart? In the issue @JurTurFer mentioned:
I don't think that you will have any trouble with the change. TBH, IDK why they set
accurate. We suggest using accurate only in the case of knowing that we job is completed just at the end and not in the meantime. Docs explain how they work (a bit below) but the main difference is how both strategies take into account the current jobs.
https://github.com/kedacore/keda/issues/4833#issuecomment-1658887078
Can you share the test script you are using to see this behavior?
To answer your question: I have Geb tests for some of my applications. To connect to the Grid I use the RemoteWebDriver from org.seleniumhq.selenium:selenium-remote-driver:3.141.59.
Hey @diemol ,
I asked the KEDA project as well. And it seems the issue is with the
scalingStrategy. When I set it todefaultit works.See here: kedacore/keda#4833
Is there any specific reason the default is set to
accuratein this Chart? In the issue @JurTurFer mentioned:I don't think that you will have any trouble with the change. TBH, IDK why they set
accurate. We suggest using accurate only in the case of knowing that we job is completed just at the end and not in the meantime. Docs explain how they work (a bit below) but the main difference is how both strategies take into account the current jobs.
@msvticket do you know?
For reference: It was set to accurate right from the beginning: https://github.com/SeleniumHQ/docker-selenium/commit/f0bbfe02c318ac58b8875f8f26c607ca86b9cf42
I could not find any discussion about the strategy in the PR.
I am seeing similar behavior with scalingType: deployment . Kubernetes version is 1.23 for me as well. Observation : wdio framework gets 504 gateway timeout error. A session is started on node but browser does nothing. Few sessions are shown as pending in queue as well.
I will try following and share results:
- Increase timeout on ingress.
- Increase default connect timeout in wdio framework.
We currently experience a problem with the default strategy as well: It expects the sessions to stay in the queue while they are worked on. The calculation for the scaled jobs basically checks, whether more jobs are running than are in the queue. And if that's the case no new job is scheduled.
Maybe that's what was tried to be fixed by using the accurate strategy? We are currently checking if and how we can implement a custom strategy instead.
I am seeing similar behavior with scalingType: deployment . Kubernetes version is 1.23 for me as well. Observation : wdio framework gets 504 gateway timeout error. A session is started on node but browser does nothing. Few sessions are shown as pending in queue as well.
I will try following and share results:
- Increase timeout on ingress.
- Increase default connect timeout in wdio framework.
Updates with scalingType: deployment . We have seen improvements after increasing the timeouts in the ingress. The pending sessions are not there anymore.
Hey @diemol ,
I asked the KEDA project as well. And it seems the issue is with the
scalingStrategy. When I set it todefaultit works.
Your mileage may vary apparently. For me it worked much better with accurate. The scale up was way to slow with default. I suppose it depends on your priorities: do you want fast scaling reponse choose accurate, if you want to be sure you don't end up with too many pods choose default.
That might be another issue (we did not have issues with too slow startup though).
A bigger problem is the calculation of the scaling itself. I dug deeper into the KEDA code and found out, that the default strategy assumes, that "locked messages" (so the ones, that are in progress already) stay in the queue. Which is not the case in the Selenium Grid. This leads to the issue, that new sessions are only started once the queue length exceeds the number of currently running jobs.
This issue is exactly what the accurate strategy solves:
If the scaler returns queueLength (number of items in the queue) that does not include the number of locked messages, this strategy is recommended.
see https://keda.sh/docs/2.11/concepts/scaling-jobs/
I suppose it depends on your priorities: do you want fast scaling reponse choose
accurate, if you want to be sure you don't end up with too many pods choosedefault.
The issue was not only that we started too many pods, but rather, that additional jobs were started which never finished. I had this in a test setup with only a single session though. I'm not sure, if not later on another session might be taken over by the additional job. Do you have any experience there?
I suppose it depends on your priorities: do you want fast scaling reponse choose
accurate, if you want to be sure you don't end up with too many pods choosedefault.The issue was not only that we started too many pods, but rather, that additional jobs were started which never finished.
Which is the same thing.
I had this in a test setup with only a single session though. I'm not sure, if not later on another session might be taken over by the additional job. Do you have any experience there?
Yes it would.
That might be another issue (we did not have issues with too slow startup though).
A bigger problem is the calculation of the scaling itself. I dug deeper into the KEDA code and found out, that the
defaultstrategy assumes, that "locked messages" (so the ones, that are in progress already) stay in the queue. Which is not the case in the Selenium Grid. This leads to the issue, that new sessions are only started once the queue length exceeds the number of currently running jobs.This issue is exactly what the
accuratestrategy solves:
Exactly. Which is why I choose accurate as the default strategy in the chart.
I have been experimenting with both type of scaling strategies (job/deployment) and seeing multiple jobs getting triggered . In one occurrence it started 16 Jobs for just two test cases. For now I am sticking with deployment and wait to hear more from others on this behavior. I tried KEDA 2.12.0 as well.
Any update on this one? we are also started having this issue after upgrading KEDA to 2.12.0 from 2.11.1
Fortunately (or unfortunately for you) we don't have the problem anymore. This was happening when we had a test setup, that only one application used at the time. When we scaled this up it just went away. We are running 100s of jobs daily now and no issues with "extra spawned jobs" so far.
Sorry, that I cannot be of more help :/
See also my comment here: https://github.com/kedacore/keda/issues/4833#issuecomment-1793442346