selenium-grid-autoscaler icon indicating copy to clipboard operation
selenium-grid-autoscaler copied to clipboard

PSA: You should use Keda for selenium autoscaling in k8s

Open lenardchristopher opened this issue 3 years ago • 10 comments

First off, thanks for sahajamit for writing this initially. 👏🏻

The preferred solution is to use Keda. It's a CNCF project, which means on-going support as this repo looks unmaintained.

https://keda.sh/docs/2.5/scalers/selenium-grid-scaler/

lenardchristopher avatar Jan 20 '22 19:01 lenardchristopher

@lenardchristopher Any idea if this actually solves the issue of scaling down using k8s' HPA with selenium grid?

The big issue of using HPA with selenium grid (explained here in more detail) is that when a scale-down happens the pods are killed at random. Sometimes those random pod removals are the ones that are still running tests which will ruin the test run.

The purpose of this code was to scale up based on load but then scale-down ONLY when utilization hit 0 (no tests running or in queue).

I am currently attempting a POC with keda. Scale-up behavior seems great but scale-down so far suffers from the same issue where there is a random chance that a pod killed will be one that is in use.

Wolfe1 avatar Feb 03 '22 19:02 Wolfe1

@Wolfe1 I looked at the Keda selenium scaler code and it appears to only be using the session queue size with no consideration for sessions. It seems like a good idea to implement your session management logic into the scaler. I'm happy to implement it myself. However, I'm an operator of selenium -- not a user. Can you please share any relevant knowledge you have about consuming the selenium API? What is the general logic for determining if a session is running on a worker?

Also, I couldn't find anyone else complaining about session management in the Keda repo. Does this surprise you or is this an edge case?

lenardchristopher avatar Feb 07 '22 14:02 lenardchristopher

@lenardchristopher I want to preface that I am no expert here and am still learning about HPA and Keda so I may be missing an obvious solution here 😁

With that said it does really surprise me that this has not been raised as I don't think it is an edge case at all.

Keda currently IS taking into consideration sessions in use and session queue size. The issue lies in when Keda tells the HPA to scale down it will kill pods at random without any consideration if that pod is idle or not.

What we have been working with that Keda could hopefully also do:

  • When we scale down we want to only do so if the current session count (for our browser type) is 0.
    • This way we can be confident that if we Scale Down to minimum scale, we won't terminate any running sessions of that browser type.
  • Is it possible to have Keda only send the signal to scale down if there are no active sessions of the set browser type?
    • This could be a metadata flag like "cautiousScaleDown" that defaults to false to preserves backwards compatibility
    • Perhaps I am missing something with the scaleDown behavior. Is it possible to have the HPA only scale down when doing so would result in minimum scale?
      • AKA Ignore the metrics from Keda until Keda says that we have no load so we can scale down to minimum replicas.

In a Perfect World this is what we are looking for:

  • When we scale down, we would only scale down pods that were not in use.
    • Not in use being that selenium does not have an active session on that node.
  • As far as I know, this can not be done (at least easily) with the HPA alone. In order to get it working with the HPA we would need to somehow keep the podDeltionCost of the pod updated when its running or not running a test.
    • This is also a fairly new feature and other solutions are being built around it:
      • https://github.com/kubernetes/kubernetes/issues/107598
  • All-in-all, I think this is outside of what KEDA would be doing right now as the situation is evolving here.

Other things I am trying:

  • In my search to get Keda working (as I do like the solution much better and it would support more than just chrome) I am attempting to make the browser pods more resilient.
    • Selenium grid allows for pods to be "Drained" which will make the node stop taking session requests, finish its current session, then disconnect.
    • To use that I added a preStop to my chrome pods yaml (along with a longer terminationGracePeriodSeconds):
              preStop:
                exec:
                  command:
                  - /bin/sh
                  - -c
                  - curl --request POST 'localhost:5555/se/grid/node/drain' --header 'X-REGISTRATION-SECRET;' && tail --pid=$(pgrep -f 'node --bind-host false --config /opt/selenium/config.toml') -f /dev/null; sleep 60s```
      
      - The trail is added to wait for the node process to finish before continuing with the pod termination
      
  • This way when the HPA would scale down the deployment, if it hit a node in use it would drain before terminating, resolving my issue.
    • Unfortunately I have not had good luck here. I can see my tests keep running on the session while its draining but at some point it kills the session too early. Still trying to figure out if this is an issue on my end or a bug with the draining functionality.

Other Information:

  • Selenium grid graphql docs: https://www.selenium.dev/documentation/grid/advanced_features/graphql_support/

Thank you for your time.

Wolfe1 avatar Feb 07 '22 16:02 Wolfe1

Hi @Wolfe1.

Interesting that there is an node drain option through the grid. I assume prestop did not work?

If it doesn't, I don't know of anything offhand that would tell the hpa which pods to drop specifically. There might be something, but I'll need to research it further.

lenardchristopher avatar Feb 10 '22 21:02 lenardchristopher

I found this comment about trying jobs instead of deployments. I will probably try this out tomorrow. https://giters.com/SeleniumHQ/selenium/issues/9845#issuecomment-1016996290

lenardchristopher avatar Feb 10 '22 21:02 lenardchristopher

Hey @lenardchristopher,

preStop has got me close (after a lot of trial and error), to the point that I would say its working but I am still hitting some technical hurdles.

My current preStop command (currently have my terminationGracePeriodSeconds set to something like 3600 or 7200):

command: ["/bin/sh", "-c", "tail --pid=$(pgrep -f '[o]pt/selenium/chromedriver') -f /dev/null; curl --request POST 'localhost:5555/se/grid/node/drain' --header 'X-REGISTRATION-SECRET;'; tail --pid=$(pgrep -f '[n]ode --bind-host false --config /opt/selenium/config.toml') -f /dev/null"]

  • When we go to stop this will:
    • Check and follow the chromedriver process (only runs when a session is running)
    • Send the command to drain the node
    • Check and follow the node process, wait for it to finish before full pod termination.
    • Possibly a short sleep at the end in case there is instability (not in the scrip right now)

The current issue is that we would randomly still lose connection on a few builds, seems to recover and tests can pass but the logs fill up with errors. I think this has to do with the way we are scaling up and down the cluster for more pods though and not the preStop. Going to try using virtual nodes for the pods today just waiting on help from our devops resource to enable it in our cluster.

Wolfe1 avatar Feb 11 '22 13:02 Wolfe1

@lenardchristopher Well azure container instances are...not great (slow to spin up, inconsistent) so I went another direction. testing out having the selenium browser pods on their own nodepool as to prevent the scaling up and down of the cluster causing the tests to lose connection to the grid periodically.

So far so good, using keda for autoscaling and having that preStop command (with a longer terminationGracePeriodSeconds) is all very stable so far.

Wolfe1 avatar Feb 16 '22 16:02 Wolfe1

@lenardchristopher

Finally got around to documenting how I got this working with KEDA in case it helps anyone: https://www.linkedin.com/pulse/scaling-kubernetes-selenium-grid-keda-brandon-wolfe/

Wolfe1 avatar May 31 '22 15:05 Wolfe1

I tried the same configuration but keda hpa is giving me FailedComputerMetricsReplicas. It is not able to get the metrics and scale image

Anyway to resolve this https://github.com/DeepKandey/POMFramework

DeepKandey avatar Mar 11 '23 10:03 DeepKandey

One thing I don't see mentioned enough about using KEDA is that if you use it in your cluster you need to use it for all your scaling needs. When I attempted to implement KEDA for our SE4 grid it worked great, but it interfered with all previously configured traditional HPAs in our cluster since it became the API. This caused all kinds of problems and I had to remove it. Is there a solution to this at all that doesn't require forcing everyone to use KEDA where it's not necessary (they only need to scale using traditional HPA metrics). Sorry if this is the wrong place to ask but it's really hard finding a solution for this that doesn't involve a lot of change for the rest of our infrastructure and it's not talked about much elsewhere.

I'm interested in this solution because it doesn't involve KEDA and wouldn't disrupt our current HPA efforts, and I'm wondering if anyone has tried this with an SE4 grid.

phebus avatar Feb 29 '24 19:02 phebus