selenium icon indicating copy to clipboard operation
selenium copied to clipboard

[🐛 Bug]: Under high load, timed out sessions are not removed from session queue

Open bhecquet opened this issue 1 year ago • 21 comments

What happened?

Hello,

on our setup (a hub with 15 nodes), under high load, we see that the hub tries to create sessions that are already timed out.

I can reproduce this with a rather smaller setup:

  • 1 hub, 5 nodes allowing each at most 2 sessions
  • Hub is configured with a --session-request-timeout 30 --session-retry-interval 1000 --reject-unsupported-caps true Hub is running on a small server (1 CPU)
  • 1 TestNG test that runs 60 tests with a parallelism of 15 (so more than the grid can handle) This test is very simple: it starts a browser, fill 2 text fields and click on 1 button. If necessary I will provide a pure Selenium script (for now, it's using our framework, but I'm pretty sure the problem is not there
  • Wait a bit (may take 15 to 30 mins to happen) and at a certain point, no more sessions are created

Expected results Grid continues to create session according to its capacity

Current result: No more session created: this is due to #12848 (which is a good point) which kills the browser immediately if the session is timed out At this point, client (see the logs) receives the message "New session request timed out " as expected

Reading the code in LocalNewSessionQueue, I can't see why this can happen because there are 2 guards

  • timeoutSessions (https://github.com/SeleniumHQ/selenium/blob/49214cd40467c2926d1b450d86dde9363bd0acd3/java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java#L160) which should remove such orphan / timed out sessions every 10 secs
  • addToQueue mechanism https://github.com/SeleniumHQ/selenium/blob/49214cd40467c2926d1b450d86dde9363bd0acd3/java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java#L235C1-L240C8 which removes session request from queue in any case

I've added some logs to grid and to capabilities to follow the session request flow

Looking at the logs, I can see 2 problems:

  • timeoutSessions method does not seem to run regularly. There are some gaps between executions. As it's a standard JVM feature, this gaps may be due to the method taking more time to execute
  • Probable source of problem: looking at the logs, we can see that following the "(17:02:39,917) addToQueue", a "(17:02:40,783) getNextAvailable" is called, bringing the session request to the LocalDistributor. Here, the session request is not in the queue anymore so it cannot be timed out But, (17:04:18,884) handleNewSessionRequest is done very late, I suppose because there is queueing in sessionCreatorExecutor

Do you confirm my analysis ?

A workaround would be to increase the CPU number / or thread pool of the executor, but this would only be a workaround

I'll try to imagine a correction, but don't hesitate to suggest ideas

Bertrand

How can we reproduce the issue?

See description above

Relevant log output

############## HUB ############################
Logs are filtered for readability. I've kept only logs relative to a specific session request and the "timeoutSessions" logs


 INFO  2024-03-22 17:02:21,772 [Local New Session Queue] LocalNewSessionQueue: timeout sessions
...
INFO  2024-03-22 17:02:39,917 [pool-2-thread-64] LocalNewSessionQueue: addToQueue 7fe2a2fd-0d41-446e-ad3b-1c992677012d date: [Capabilities {acceptInsecureCerts: true, browserName: chrome, goog:chromeOptions: {args: [--disable-translate, --disable-web-security, --no-sandbox, --disable-site-isolation-tr..., --disable-features=IsolateO..., --remote-allow-origins=*], extensions: [], prefs: {profile.exit_type: Normal}}, pageLoadStrategy: normal, platformName: any, proxy: {proxyType: autodetect}, sr:beta: false, sr:creationRequestTime: 2024-03-22T17:02:39.8291502, sr:testName: loginInvalidPerf15, sr:try: 15}]
 INFO  2024-03-22 17:02:40,783 [Local Distributor - New Session Queue] LocalNewSessionQueue: getNextAvailable
 INFO  2024-03-22 17:02:40,783 [Local Distributor - New Session Queue] LocalNewSessionQueue: remove 7fe2a2fd-0d41-446e-ad3b-1c992677012d
 INFO  2024-03-22 17:02:40,783 [Local Distributor - New Session Queue] LocalNewSessionQueue: end remove 7fe2a2fd-0d41-446e-ad3b-1c992677012d
...

 INFO  2024-03-22 17:02:31,732 [Local New Session Queue] LocalNewSessionQueue: timeout sessions
 INFO  2024-03-22 17:02:41,735 [Local New Session Queue] LocalNewSessionQueue: timeout sessions
 INFO  2024-03-22 17:02:51,742 [Local New Session Queue] LocalNewSessionQueue: timeout sessions
 INFO  2024-03-22 17:03:01,750 [Local New Session Queue] LocalNewSessionQueue: timeout sessions
 INFO  2024-03-22 17:03:09,919 [pool-2-thread-64] LocalNewSessionQueue: end addToQueue 7fe2a2fd-0d41-446e-ad3b-1c992677012d
 INFO  2024-03-22 17:03:11,756 [Local New Session Queue] LocalNewSessionQueue: timeout sessions
 INFO  2024-03-22 17:03:21,772 [Local New Session Queue] LocalNewSessionQueue: timeout sessions
 INFO  2024-03-22 17:03:31,773 [Local New Session Queue] LocalNewSessionQueue: timeout sessions
...
... (at this point, client has received "New session request timed out" error, see below)
...
INFO  2024-03-22 17:04:18,884 [Local Distributor - Session Creation] LocalDistributor$NewSessionRunnable: handleNewSessionRequest 7fe2a2fd-0d41-446e-ad3b-1c992677012d
 INFO  2024-03-22 17:04:18,884 [Local Distributor - Session Creation] LocalDistributor: newSession 7fe2a2fd-0d41-446e-ad3b-1c992677012d
 INFO  2024-03-22 17:04:18,884 [Local Distributor - Session Creation] Sys$Out: 17:04:18.884 INFO [LocalDistributor.newSession_aroundBody0] - Session request received by the Distributor: 
 [Capabilities {acceptInsecureCerts: true, browserName: chrome, goog:chromeOptions: {args: [--disable-translate, --disable-web-security, --no-sandbox, --disable-site-isolation-tr..., --disable-features=IsolateO..., --remote-allow-origins=*], extensions: [], prefs: {profile.exit_type: Normal}}, pageLoadStrategy: normal, platformName: any, proxy: {proxyType: autodetect}, sr:beta: false, sr:creationRequestTime: 2024-03-22T17:02:39.8291502, sr:testName: loginInvalidPerf15, sr:try: 15}]
...
INFO  2024-03-22 17:04:22,887 [Local Distributor - Session Creation] Sys$Out: 17:04:22.887 INFO [LocalDistributor.newSession_aroundBody0] - Session created by the Distributor. Id: 6bb90d1c0268b036743cc04aea1680a4 
 Caps: Capabilities {acceptInsecureCerts: true, browserName: chrome, browserVersion: 120.0.6099.130, chrome: {chromedriverVersion: 120.0.6099.109 (3419140ab66..., userDataDir: C:\Users\ZSJP_C~1\AppData\L...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:53256}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: any, proxy: {proxyType: autodetect}, se:bidiEnabled: false, se:cdp: ws://selenium-grid-hub2.com..., se:cdpVersion: 120.0.6099.130, setWindowRect: true, sr:beta: false, sr:creationRequestTime: 2024-03-22T17:02:39.8291502, sr:testName: loginInvalidPerf15, sr:try: 15, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
 INFO  2024-03-22 17:04:23,069 [Event Bus Listener Notifier] Sys$Out: 17:04:23.068 INFO [GridModel.release] - Releasing slot for session id 6bb90d1c0268b036743cc04aea1680a4
 INFO  2024-03-22 17:04:23,069 [Event Bus Listener Notifier] Sys$Out: 17:04:23.068 INFO [LocalSessionMap.lambda$new$0] - Deleted session from local Session Map, Id: 6bb90d1c0268b036743cc04aea1680a4

############## Client #########################
WARN  2024-03-22 17:03:09,931 [TestNG-methods-15] SeleniumGridDriverFactory: Error creating driver on hub http://selenium-grid-hub.company:4444/wd/hub: Could not start a new session. Response code 500. Message: Could not start a new session. New session request timed out 
Host info: host: 'xxx', ip: 'a.b.c.d'
Build info: version: '4.16.1', revision: '9b4c83354e'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-1160.105.1.el7.x86_64', java.version: '21.0.1'
Driver info: driver.version: unknown
Build info: version: '4.11.0', revision: '040bc5406b'
System info: os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '17.0.6'
Driver info: org.openqa.selenium.remote.RemoteWebDriver
Command: [null, newSession {capabilities=[Capabilities {acceptInsecureCerts: true, browserName: chrome, goog:chromeOptions: {args: [--disable-translate, --disable-web-security, --no-sandbox, --disable-site-isolation-tr..., --disable-features=IsolateO..., --remote-allow-origins=*], extensions: [], prefs: {profile.exit_type: Normal}}, pageLoadStrategy: normal, platformName: any, proxy: Proxy(autodetect), sr:beta: false, sr:creationRequestTime: 2024-03-22T17:02:39.8291502, sr:testName: loginInvalidPerf15, sr:try: 15}]}]

Operating System

Linux

Selenium version

4.11.0

What are the browser(s) and version(s) where you see this issue?

Chrome / not related to browser

What are the browser driver(s) and version(s) where you see this issue?

not related to browser

Are you using Selenium Grid?

4.16.1

bhecquet avatar Mar 22 '24 17:03 bhecquet

@bhecquet, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

github-actions[bot] avatar Mar 22 '24 17:03 github-actions[bot]

I may have a clue on the root cause I've a mechanism that prevent the node (through a custom SlotSelector) to run more that 2 sessions at a time except if I pass a specific capability so that, during a test, I can handle the case where a browser (e.g chrome) starts an other browser (e.g: Edge). This feature seems to break the distributor because it believes it has free slot whereas we prevent it from using them. Will see if a correction on my side solves the problem

bhecquet avatar Mar 22 '24 17:03 bhecquet

Hello, after investigating and correcting my implementation, I thinks there is still a weakness in LocalDistributor implementation Imagine you have

  • ah in my case, 1 CPU: LocalDistributor.sessionCreatorExecutor will only hold 3 threads
  • one node that has capacity => getNextAvailable may return as many session requests as slots in the node
  • some nodes (at least 2 in my case) that dramatically slow down during session creation

Then all creation threads will be stuck and new session requests will arrive at regular interval because one node can still accept sessions. The LocalDistributor.sessionCreatorExecutor will then contain session request that may have already expired

Increasing thread pool size could be a workaround but would there be a mean to check that the sessionCreatorExecutor has room for creating sessions ?

bhecquet avatar Mar 26 '24 07:03 bhecquet

I see that you have few resources, and when a Node is slow in creating a session, it affects the rest because not many threads are being processed.

You could increase the --max-threads and see if it helps.

diemol avatar Mar 26 '24 10:03 diemol

Hello @diemol ,

Thanks for your reply Correct me if I'm wrong, but --max-threads option read be BaseServerOptions.getMaxServerThreads() is not used in code (or I'm missing something), and it has no influence on LocalDistributor thread pool size. Anyway, I will try the --newsession-threadpool-size option

bhecquet avatar Mar 26 '24 14:03 bhecquet

You are correct; newsession-threadpool-size should be the one. I just mixed the flags.

diemol avatar Mar 26 '24 15:03 diemol

Hello,

increasing thread pool and also a filter on sessionCreatorExecutor queue, so that no new sessions requests are added if some are still waiting in queue. This seems to work pretty well. The most important thing is that if slowness come when creating sessions, we wait for queue to be empty to send more session requests to LocalDistributor

bhecquet avatar Mar 27 '24 10:03 bhecquet

Thank you for sharing that information.

Do you think we need to do something else or we can close this issue?

diemol avatar Mar 27 '24 13:03 diemol

As I said, I think something could be improved in handling new sessions request especially when hub or node are slow, namely to avoid creating sessions that are already timed out (for example, one could add a job like the one in LocalNewSessionQueue) that would remove session requests that are timed out, before trying to create them

But if I'm the only one to have this problem, then I can leave with my workaround that does this (and a bit more)

bhecquet avatar Mar 27 '24 15:03 bhecquet

But if a session request has timed out, the Distributor should not be able to retrieve it. Maybe I do not understand the scenario.

diemol avatar Mar 27 '24 17:03 diemol

In the normal process:

  • session request received and put in LocalNewSessionQueue through 'addToQueue'
  • Distributor takes it through 'getNextAvailable' so it's removed from LocalNewSessionQueue
  • Distributor processes it with 'handleNewSession' & 'newSession'
  • Session is returned to client

In case of high load (ten's of sessions received by the hub and few nodes to handle them) AND session-timeout (in my case 30 secs) set on NewSessionQueue:

  • session request received and put in LocalNewSessionQueue through 'addToQueue'
  • some time elapse, says 15 secs
  • Distributor takes the session request through 'getNextAvailable' so it's removed from LocalNewSessionQueue and 'timeoutSessions' won't handle it anymore until it returns to LocalNewSessionQueue
  • session request is not immediately handles because there are already sessions in Distributor queue. Session request can wait here during 15, 20 secs and even more, reaching the timeout. LocalNewSessionQueue replies to client that session has timedout, but session request is still in Distributor queue And so, when it's finally handled, it has effectively timed out

bhecquet avatar Mar 28 '24 07:03 bhecquet

Distributor takes the session request through 'getNextAvailable' so it's removed from LocalNewSessionQueue and 'timeoutSessions' won't handle it anymore until it returns to LocalNewSessionQueue

But the Distributor only takes more session requests if there is a stereotype (slot) available on the Node. That is why I am confused.

diemol avatar Mar 28 '24 11:03 diemol

I think you talk about this portion of code:

https://github.com/SeleniumHQ/selenium/blob/b7d831db8cfeac9dfbf441c623c6d4bf580f2cc5/java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java#L780C8-L785C98

This only check if there is at least one slot available. If it's the case, then (as soon as I understand the code correctly), then all the slots are returned. Imagine you have a node that can handle:

  • 5 sessions in parallel
  • with 5 slots firefox
  • and 5 slots chrome 4 firefox slots are busy, so, NodeStatus.hasCapacity will reply true, and then LocalDistributor may accept 5 new chrome session requests and 5 new firefox session requests

bhecquet avatar Mar 28 '24 13:03 bhecquet

@bhecquet starting more sessions than slots is pervented later in this lines: https://github.com/SeleniumHQ/selenium/blob/b7d831db8cfeac9dfbf441c623c6d4bf580f2cc5/java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java#L550-L565

joerg1985 avatar Apr 02 '24 22:04 joerg1985

@joerg1985, yes, sure, but the problem is in the delay starting session when nodes / distributor are slow, not the fact that there are more sessions created than expected

bhecquet avatar Apr 03 '24 06:04 bhecquet

@bhecquet my answer should say: rejecting these more new session requests than slots will happens pretty fast and should not add to much overhead in processing.

joerg1985 avatar Apr 03 '24 19:04 joerg1985

@joerg1985 , you're right, this step is quick. But in case a slot is free at this moment, a session will be created whereas it may already have expired, which takes time on overused LocalDistributor. And juste after that, the session will be removed because it has already timed out

bhecquet avatar Apr 04 '24 16:04 bhecquet

Hello folks, especially to our distinguished hardworking contributors,

I was executing 50 tests + 30 every minute for 5 minutes. The first 3 minutes was flawless. Each batch of tests in the first 3 minutes completed on average under 3 minutes without any issues. The 4th and 5th batch got stuck due to the distributor and router consuming too much resources and starving the rest of the components including themselves.

Netty eats up a lot of memory (HashMap$TreeNode.find and JsonOutput.getMethod) and never goes down until bounced while the java component of distributor consumes a lot of cpu for the threads. Local Distributor - Session Creation and Local Distributor - New Session Queue gets locked up while HttpClient-*-SelectorManager waits patiently, from there everything goes haywire.

Screenshot 2024-08-23 183520 In addition to timed out sessions not removed from queue, it also continually provisions more browser nodes even if Queue size: 0 for a long time.

Below is how it looks like at the end of the run. There were no more tests in the queue but it went up as high as 831 idle browser nodes Screenshot 2024-08-23 185336

I can reproduce this consistently on 4.23.0 (revision 77010cd) and would be willing to demo live.

Please provide your email if interested for a working session and see more details.

Thanks!

rx4476 avatar Aug 24 '24 00:08 rx4476

@rx4476 the leaking HttpClient-*-SelectorManager threads should have been fixed, but these fixes are not released yet. Commits 97d56d04e1b4ab4f8e527f8849b777c1e91d13f7, a5de3775db76dde54b9c6a94fbe3f2b816eacb80, ed3edee0ac4f09a5555aec7c7ab20609d8b394f2, 5bac4795ff6acc9f1ca1a6436aea0970ff98fb07 and 7612405e34d282992b26a22d7bee921753020026 are all related to this, so you could check the nightly snapshots.

joerg1985 avatar Aug 24 '24 21:08 joerg1985

@joerg1985 thank you for the update.

HttpClient-*-SelectorManager, and JdkHttpClient are more of a victim rather than the culprit. They have been waiting patiently from garbage collection and suspension to be over.

Garbage collection and memory allocation (Compared to LocalDistributor$NewSessionRunnable.run, SelectorManager allocation/utilization is negligible): Screenshot 2024-08-27 143924

Waiting threads: Screenshot 2024-08-27 142048

Here is where the thread gets locked: Screenshot 2024-08-27 154434

rx4476 avatar Aug 27 '24 20:08 rx4476