[🐛 Bug]: selectSlot() latency under load due to getLoad() + parallelStream() causes random SessionNotCreatedException
Description
We are experiencing random SessionNotCreatedException in Selenium Grid 4.13 (hub). The problem mostly occurs under high concurrency and is timing-sensitive — it is not reproducible reliably in small-scale or synthetic tests.
Environment
- Hub: Selenium 4.13
- Nodes: 20–50, each with 19 Firefox + 19 Chrome slots (38 slots per node)
- OS: Linux, Java 11
- Browsers: Firefox and Chrome
Exception in client
TimeoutException after three minutes
SEVERE: Exception occurred while doing remote webdriver testing
org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Possible causes are invalid address of the remote server or browser start-up failure.
Host info: host: 'test', ip: 'fe80:0:0:0:0:95e7:1b5d:ae0e%en0'
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:563)
at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:245)
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:174)
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:152)
at com.test.selenium.GridTest.getChromeDriver(GridTest.java:400)
at com.test.selenium.GridTest.doTest(GridTest.java:143)
at com.test.selenium.GridTest.main(GridTest.java:98)
Caused by: org.openqa.selenium.TimeoutException: java.util.concurrent.TimeoutException
Build info: version: '4.23.1', revision: '656257d8e9'
System info: os.name: 'Mac OS X', os.arch: 'aarch64', os.version: '15.6.1', java.version: '11.0.16.1'
Driver info: driver.version: RemoteWebDriver
at org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute0(JdkHttpClient.java:418)
at org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42)
at org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:55)
at org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute(JdkHttpClient.java:374)
at org.openqa.selenium.remote.tracing.TracedHttpClient.execute(TracedHttpClient.java:54)
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:89)
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:75)
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:61)
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:162)
at org.openqa.selenium.remote.TracedCommandExecutor.execute(TracedCommandExecutor.java:53)
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:545)
... 6 more
Caused by: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
at org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute0(JdkHttpClient.java:401)
... 16 more
Problem details
- Thread dumps show “Local Distributor - Session Creation” threads blocked in
DefaultSlotSelector.selectSlot()/NodeStatus.getLoad():
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.Object.wait([email protected]/Native Method)
- waiting on <no object reference available>
at java.util.concurrent.ForkJoinTask.externalAwaitDone([email protected]/Unknown Source)
- waiting to re-lock in wait() <0x00000007e5193e50> (a java.util.stream.ReduceOps$ReduceTask)
at java.util.concurrent.ForkJoinTask.doInvoke([email protected]/Unknown Source)
at java.util.concurrent.ForkJoinTask.invoke([email protected]/Unknown Source)
at java.util.stream.ReduceOps$ReduceOp.evaluateParallel([email protected]/Unknown Source)
at java.util.stream.ReduceOps$5.evaluateParallel([email protected]/Unknown Source)
at java.util.stream.ReduceOps$5.evaluateParallel([email protected]/Unknown Source)
at java.util.stream.AbstractPipeline.evaluate([email protected]/Unknown Source)
at java.util.stream.ReferencePipeline.count([email protected]/Unknown Source)
at org.openqa.selenium.grid.data.NodeStatus.getLoad(NodeStatus.java:174)
at org.openqa.selenium.grid.distributor.selector.DefaultSlotSelector.lambda$selectSlot$1(DefaultSlotSelector.java:75)
at org.openqa.selenium.grid.distributor.selector.DefaultSlotSelector$$Lambda$1141/0x0000000800562040.applyAsDouble(Unknown Source)
at java.util.Comparator.lambda$comparingDouble$8dcf42ea$1([email protected]/Unknown Source)
at java.util.Comparator$$Lambda$1142/0x0000000800562440.compare([email protected]/Unknown Source)
at java.util.Comparator.lambda$thenComparing$36697e65$1([email protected]/Unknown Source)
at java.util.Comparator$$Lambda$1143/0x0000000800562840.compare([email protected]/Unknown Source)
at java.util.Comparator.lambda$thenComparing$36697e65$1([email protected]/Unknown Source)
at java.util.Comparator$$Lambda$1143/0x0000000800562840.compare([email protected]/Unknown Source)
at java.util.Comparator.lambda$thenComparing$36697e65$1([email protected]/Unknown Source)
at java.util.Comparator$$Lambda$1143/0x0000000800562840.compare([email protected]/Unknown Source)
at java.util.TimSort.binarySort([email protected]/Unknown Source)
at java.util.TimSort.sort([email protected]/Unknown Source)
at java.util.Arrays.sort([email protected]/Unknown Source)
at java.util.ArrayList.sort([email protected]/Unknown Source)
at java.util.stream.SortedOps$RefSortingSink.end([email protected]/Unknown Source)
at java.util.stream.Sink$ChainedReference.end([email protected]/Unknown Source)
at java.util.stream.AbstractPipeline.copyInto([email protected]/Unknown Source)
at java.util.stream.AbstractPipeline.wrapAndCopyInto([email protected]/Unknown Source)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential([email protected]/Unknown Source)
at java.util.stream.AbstractPipeline.evaluate([email protected]/Unknown Source)
at java.util.stream.ReferencePipeline.collect([email protected]/Unknown Source)
at org.openqa.selenium.grid.distributor.selector.DefaultSlotSelector.selectSlot(DefaultSlotSelector.java:86)
at org.openqa.selenium.grid.distributor.local.LocalDistributor.reserveSlot(LocalDistributor.java:669)
at org.openqa.selenium.grid.distributor.local.LocalDistributor.newSession(LocalDistributor.java:551)
[tdump1.txt](https://github.com/user-attachments/files/22044863/tdump1.txt)
[tdump2.txt](https://github.com/user-attachments/files/22044864/tdump2.txt)
[tdump3.txt](https://github.com/user-attachments/files/22044865/tdump3.txt)
at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable.handleNewSessionRequest(LocalDistributor.java:829)
at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable.lambda$run$1(LocalDistributor.java:787)
at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable$$Lambda$1125/0x0000000800552840.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/Unknown Source)
at java.lang.Thread.run([email protected]/Unknown Source)
- Observations:
- The write lock on LocalDistributor is held during
reserveSlot()which invokesselectSlot(). NodeStatus.getLoad()uses a parallelStream(), which creates many ForkJoinPool tasks.- This does not appear to be a deadlock or hard blocking, but rather latency/slow execution under load.
- Other session requests queue up behind the write lock, leading to random
SessionNotCreatedExceptionwhen delays exceed timeouts.
- Thread dump behavior:
- Different dumps show different threads inside
selectSlot()or ForkJoinPool tasks. - Confirms the issue is contention/latency, not a deadlock.
- Probable cause:
selectSlot()performs O(N log N) comparator calls during node sorting.- Each comparator calls
getLoad(), which scans all slots via parallelStream(), creating thousands of ForkJoinPool tasks per session request. - This repeated scanning is likely the main source of latency.
Impact
- Random
SessionNotCreatedExceptionunder load. - Not reproducible under light or synthetic load.
tdump1.txt tdump2.txt tdump3.txt
Reproducible Code
Not easily reproducible
Debugging Logs
Find the attached thread dumps
@Zanis7, thank you for creating this issue. We will troubleshoot it as soon as we can.
Selenium Triage Team: remember to follow the Triage Guide
⚠️ You reported using Selenium version 4.13, but the latest release is 4.35.
Please verify that this issue still occurs with the latest version. If it no longer applies, you can close this issue or update your comment.
This issue will be marked "awaiting answer" and may be closed automatically if no response is received.
The getLoad() and selectSlot() code is mostly same in both 4.13 and latest versios. So it should occur in latest version as well.
We still need a better input in order to reproduce this.
We are facing SessionNotCreatedException randomly when running a high number of sessions. The exception is caused by a TimeoutException (you can see the full stack trace in my post).
Setup Details:
- 25 nodes, each capable of handling 19 sessions (Firefox and Chrome).
- In total, the hub can handle up to 475 sessions.
- Nodes are added dynamically to the hub based on demand, and removed when the load decreases.
The issue occurs randomly whenever there are a large number of concurrent session creation requests in a short period (around 200 per minute). Recently, we even faced the issue with a lower session creation rate (about 50 per minute), but at that time there were already 200+ active sessions running.
I want to reiterate that this issue is not easily reproducible. Based on thread dumps, I assumed it could be related to a lock.
How are the CPU and RAM used when this issue occurs? Are you measuring that?
When you have many browsers open, the machine will take longer to open a new one, which makes the whole process of reserving and allocating longer. Our recommendation is usually to have smaller Nodes and more of them.
Yes, we are monitoring CPU and RAM usage. On the Hub, both CPU and RAM usage remain below 50%. On the Nodes, CPU and RAM usage are around 80%.
Our recommendation is usually to have smaller Nodes and more of them.
I understand that. However, to reduce the number of calls to our DNS servers, we use a global disk cache. That’s why we try to run more browsers per Node. From what I can see, the Hub did not pass the request to the Node, so it doesn’t appear to be a browser startup delay.
Any blocker if you use your tests to quickly benchmark Hub: Selenium 4.35?
With 4.35, when starting Hub, you can set --slot-selector org.openqa.selenium.grid.distributor.selector.GreedySlotSelector (which is another built-in class to implement selectSlot() to see the time complexity less than the default?
In addition to that, are you continuously measuring CPU and RAM? Do you know what the usage was when the issue happened? I ask because when a browser opens, there is a peak in CPU and RAM that lasts less than a second (or, under high concurrency, a few seconds). Maybe you can correlate those two.
@VietND96 yes, we can run our tests with GreedySlotSelector.
We currently have a CustomSlotSelector which is almost similar to the GreedySlotSelector, it also picks nodes with higher load. Please find the code below. We will run tests with additional logger in our CustomSlotSelector to measure the time. And run with GreedySlotSelector as well.
Set<SlotId> slotIds = nodes.stream()
.filter(node -> node.hasCapacity(capabilities, slotMatcher))
.sorted(
Comparator.comparingLong(this::getNumberOfSupportedBrowsers)
// Now sort by node which has the highest load (natural ordering)
.thenComparingDouble(node -> -node.getLoad())
// Then last session created (newest first), so natural ordering again
.thenComparingLong(node -> node.getLastSessionCreated())
// And use the node id as a tie-breaker.
.thenComparing(NodeStatus::getNodeId))
.flatMap(
node ->
node.getSlots().stream()
.filter(slot -> slot.getSession() == null)
.filter(slot -> slot.isSupporting(capabilities, slotMatcher))
.map(Slot::getId))
.collect(toImmutableSet());
return slotIds;
@diemol We are collecting CPU and memory usage for every 5 seconds. Both CPU and RAM usage remain below 50% at the time of the issue. No CPU/Memory spikes.
We have tried to run the tests with loggers which measures the time taken by selectSlot, the logs confirm that the method usually takes less than 10ms, rarely it exceeds 50ms. Which confirms the getLoad is not a problem. I went through the Thread dumps again and found that around 300+ threads were waiting in LocalNewSessionQueue.addToQueue. As per my understanding they were waiting for Distributor threads to pick up the submitted session. We have around 72 threads for Session creations, again confirmed this with thread dump based on thread named 'Local Distributor - Session Creation'. So even if each thread takes 10 seconds to create the session, under 60 seconds 360+ sessions can be created. I am planning to measure the entire newSession process next by patching the selenium grid. Do you have any suggestions to what should I look next?
@Zanis7 you should first try to update to a recent version of selenium, as there have been some fixes in this area e.g. one i can remember is https://github.com/SeleniumHQ/selenium/commit/6cda69299366dd9e0976bede2fdbcfe42eb90cfa fixed in 4.20.0