selenium icon indicating copy to clipboard operation
selenium copied to clipboard

Dynamic Selenium 4 grid on kubernetes

Open gazal-k opened this issue 3 years ago • 83 comments

🚀 Feature Proposal

Just like dynamic Selenium 4 grid using docker, having a similar k8s "pod factory" (or something similar) would be nice.

https://github.com/zalando/zalenium does that. Perhaps some of that can be ported to grid 4

gazal-k avatar Sep 19 '21 12:09 gazal-k

We are happy to discuss approaches, what do you have in mind, @gazal-k?

diemol avatar Sep 21 '21 14:09 diemol

Sorry, I'm not really familiar with the selenium grid codebase. I imagine this: https://github.com/SeleniumHQ/selenium/blob/trunk/java/src/org/openqa/selenium/grid/node/docker/DockerSessionFactory.java has some of the logic to dynamically create browser nodes to join the grid. It would be nice to have something similar to create k8s Pods so that the kubernetes selenium 4 grid scales based on the test as opposed to creating a static number of browser nodes.

Again, sorry that I don't have something more solid to contribute.

gazal-k avatar Sep 22 '21 00:09 gazal-k

I have attempted to build something similar for Kubernetes with Selenium Grid3. More details here: https://link.medium.com/QQMCXLqQMjb

sahajamit avatar Sep 23 '21 13:09 sahajamit

I have some thoughts about how the Kubernetes support could be implemented. I remember having a look at the Grid 4 codebase in December 2018 and I wrote up my thoughts in this ticket over in Zalenium when someone asked if we planned to support Grid 4: https://github.com/zalando/zalenium/issues/1028#issuecomment-522230092 This was largely based on my ideas on how to add High-Availability support Zalenium for Kubernetes: https://github.com/zalando/zalenium/issues/484#issue-305907701 from early 2018.

So assuming the grid architecture is still the same as it was in 2018, ie router, sessionMap and distributor. Then I think my original ideas are still valid.

The crux of it was to implement the sessionMap as annotations (metadata) on a Kubernetes pod, so that Selenium Grid didn't need to maintain the session state, which means that you could scale it and make it highly available much easier.

So it means you could run multiple copies of the router, and you probably just want one distributor as you'd get into race conditions when creating new selenium pods. The sessionMap would end up just being a shared module/library that the router and distributor used to talk to the Kubernetes API server.

pearj avatar Sep 23 '21 13:09 pearj

If we wanted a more pure k8s solution, if there were metrics exposed around how many selenium sessions are in queue, or even how long they've been waiting, maybe even rate of queue processing, it would be possible to configure a horizontal pod autoscaler (HPA) around the node deployment itself to target a given rate of message processing.

LukeIGS avatar Sep 28 '21 15:09 LukeIGS

There is https://keda.sh/docs/2.4/scalers/selenium-grid-scaler/ which can autoscale nodes, it's working fine - the problem is with tearing down a node. Since it doesn't keep track of which node is working - it could kill test in progress, and it seems Chrome Node doesn't handle it gracefull.

Warxcell avatar Dec 02 '21 19:12 Warxcell

I tried another approach by implementing an application which intercepts the docker-engine calls from the selenium node-docker component and then translates those calls to k8s calls and then call the Kubernetes API. It works properly on creating and stopping browser nodes depending on the calls from node-docker. But this has a major problem because node-docker doesn't support concurrency. It can only create single browser-node, run test, destroy it and then next. (I will be creating a separate issue for that involving the docker-node as for the concurrency issue).

From what i noticed is the node-docker binds those browser nodes to itself and expose it as an session of the node-docker to the distributor. So all that the distributor sees is the node-docker and not the browser node. I think this approach is not appropriate during concurrent execution as i feel it is a point of failure and end all the sessions routed through the node-docker.

Therefore I think KEDA Selenium-Grid-AutoScaler is a much better approach.

MissakaI avatar Jan 16 '22 19:01 MissakaI

The crux of it was to implement the sessionMap as annotations (metadata) on a Kubernetes pod, so that Selenium Grid didn't need to maintain the session state, which means that you could scale it and make it highly available much easier.

There is slight issue with this as this will make the Grid 4 dependent on Kubernetes. This will make two different implementations of Grid which is specific to K8s and one that is not dependent on Kubernetes. I think much better approach is to make the Grid HA with other approaches like sharing the current state with all the instances of particular grid component type.

MissakaI avatar Jan 16 '22 19:01 MissakaI

There is slight issue with this as this will make the Grid 4 dependent on Kubernetes.

It's already dependent on Docker. Perhaps there should be some middleware for different environments.

quarckster avatar Jan 16 '22 19:01 quarckster

@MissakaI I have tested KEDA Selenium-Grid-AutoScaler and is scaling up how many nodes you need based on the queue session and is ok. The problem is with video part because doesn't work in kubernetes. I have managed to deploy video container on the same pod but the video file is not saved till the video container is not stop gracefully and also you cannot set the name of the video for every test, is recording all the time till will be closed.

qalinn avatar Jan 17 '22 08:01 qalinn

There is slight issue with this as this will make the Grid 4 dependent on Kubernetes.

It's already dependent on Docker. Perhaps there should be some middleware for different environments.

The selenium repository is currently dependent on Ruby, Python, dotnet, and quite a few other things that it probably shouldn't be, there's certainly an argument for a lot of stuff to be split out into separate modules, but that's probably a conversation for another issue.

LukeIGS avatar Jan 18 '22 14:01 LukeIGS

We had a note in the standup meeting of KEDA to see if we can help with Selenium & video. Is the person who added it part of this thread? If so, please open a discussion how we can help: https://github.com/kedacore/keda/discussions/new

tomkerkhove avatar Jan 18 '22 16:01 tomkerkhove

Will do, issue in question is https://github.com/SeleniumHQ/selenium/issues/10018

These two are pretty intertwined.

LukeIGS avatar Jan 18 '22 16:01 LukeIGS

@tomkerkhove I am that who added the note on your note standup meeting. Please see also the next issue: #10018

qalinn avatar Jan 19 '22 10:01 qalinn

Tracking this in https://github.com/kedacore/keda/discussions/2494

tomkerkhove avatar Jan 19 '22 12:01 tomkerkhove

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

To make a node exit after a session is done you need to add a property to to the node section of config.toml:

implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

With the official docker images this isn't enough since supervidord would still be running. So for that case you would need to add a supervisord event listener that finishes supervisord with its subprocesses.

One good thing with this approach is that combined with the video feature you get one video per session. Regarding graceful shutdown: In the dynamic grid code any video container is stopped before the node/browser container. So I guess the video file gets corrupted if Xvfb exits before ffmpeg is done saving the file. The event listener described above should therefore shutdown the supervisord in the video container before shutting down the one in the same container.

For shutting down supervisord, you can use the unix_http_server and supervisorctl features of supervisord. That works between containers in the pod as well.

I've also been thinking about how to have the video file uploaded to s3 (or similar) automatically. The tricky part is supplying the pod with the url to upload the file to. I have some ideas, but that have to wait until the basic solution is implemented.

msvticket avatar Jan 20 '22 00:01 msvticket

I have managed to deploy video container on the same pod but the video file is not saved till the video container is not stop gracefully and also you cannot set the name of the video for every test, is recording all the time till will be closed.

I think this case should be followed with the thread dedicated to it. Which is mentioned by @LukeIGS

Will do, issue in question is #10018

MissakaI avatar Jan 20 '22 03:01 MissakaI

Also we need a way to implement liveliness and readiness probes because i ran into few instances that the selenium process was killed and pod continues to run which results in no new pod is reinstated by Kubernetes after terminating the currently crashed pod.

MissakaI avatar Jan 20 '22 03:01 MissakaI

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

Thank you for this. I have been using deployments and thought of raising a issue KEDA to add the annotation controller.kubernetes.io/pod-deletion-cost: -999 which sets the replication controller to delete the pod with least cost and leave the others.

To make a node exit after a session is done you need to add a property to to the node section of config.toml:

implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

Also can you point me to where this was included in the Selenium Documentation if it was documented.

MissakaI avatar Jan 20 '22 03:01 MissakaI

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

Thank you for this. I have been using deployments and thought of raising a issue KEDA to add the annotation controller.kubernetes.io/pod-deletion-cost: -999 which sets the replication controller to delete the pod with least cost and leave the others.

I don't see how that would help. You could put that cost in the manifest to begin with. But in any case you end up with having to remove/update that annotation when the test is done and KEDA don't know when that is.

There is a recent proposal for Kubernetes to let the pod inform Kubernetes on which pods to delete through a probe: kubernetes/kubernetes#107598. Until something like that is implemented either the node itself or maybe the distributor would need to update the annotation.

To make a node exit after a session is done you need to add a property to to the node section of config.toml: implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

Also can you point me to where this was included in the Selenium Documentation if it was documented.

I haven't found anything about it in the documentation. I stumbled upon org.openqa.selenium.grid.node.k8s.OneShotNode when I was looking in the selenium code. It then took a while for me to find out how to make use of the class. That's implemented here: https://github.com/SeleniumHQ/selenium/blob/2decee49816aa611ce7bbad4e52fd1b29629b1df/java/src/org/openqa/selenium/grid/node/config/NodeOptions.java#L148

On the other hand I haven't tested it, so who knows if OneShotNode still works...

This is where it should be documented: https://www.selenium.dev/documentation/grid/configuration/toml_options/

msvticket avatar Jan 20 '22 07:01 msvticket

I don't see how that would help. You could put that cost in the manifest to begin with. But in any case you end up with having to remove/update that annotation when the test is done and KEDA don't know when that is.

I was intending to either write an application that will monitor the test sessions along with the respective pod or write a custom KEDA scaler that will do what i mentioned previously.

MissakaI avatar Jan 20 '22 08:01 MissakaI

There is an issue about shutting down the node container when the node server has exited: SeleniumHQ/docker-selenium#1435

msvticket avatar Jan 20 '22 09:01 msvticket

On the other hand I haven't tested it, so who knows if OneShotNode still works...

It seems like even though the code is available in the repo it causes ClassNotFoundException after adding it to the config.toml. Extracting the selenium-server-4.1.1.jar revealed that the k8s folder is completely removed.

java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.openqa.selenium.grid.Bootstrap.runMain(Bootstrap.java:77)
        at org.openqa.selenium.grid.Bootstrap.main(Bootstrap.java:70)
Caused by: org.openqa.selenium.grid.config.ConfigException: java.lang.ClassNotFoundException: org.openqa.selenium.grid.node.k8s.OneShotNode
        at org.openqa.selenium.grid.config.MemoizedConfig.getClass(MemoizedConfig.java:115)
        at org.openqa.selenium.grid.node.config.NodeOptions.getNode(NodeOptions.java:148)
        at org.openqa.selenium.grid.node.httpd.NodeServer.createHandlers(NodeServer.java:127)
        at org.openqa.selenium.grid.node.httpd.NodeServer.asServer(NodeServer.java:183)
        at org.openqa.selenium.grid.node.httpd.NodeServer.execute(NodeServer.java:230)
        at org.openqa.selenium.grid.TemplateGridCommand.lambda$configure$4(TemplateGridCommand.java:129)
        at org.openqa.selenium.grid.Main.launch(Main.java:83)
        at org.openqa.selenium.grid.Main.go(Main.java:57)
        at org.openqa.selenium.grid.Main.main(Main.java:42)
        ... 6 more
Caused by: java.lang.ClassNotFoundException: org.openqa.selenium.grid.node.k8s.OneShotNode
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:398)
        at org.openqa.selenium.grid.config.ClassCreation.callCreateMethod(ClassCreation.java:35)
        at org.openqa.selenium.grid.config.MemoizedConfig.lambda$getClass$4(MemoizedConfig.java:100)
        at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1737)
        at org.openqa.selenium.grid.config.MemoizedConfig.getClass(MemoizedConfig.java:95)
        ... 14 more

The docker image that i used was selenium/node-firefox.

MissakaI avatar Jan 21 '22 10:01 MissakaI

Well, the selenium project is a bit confusing. Apparently the selenium build system excludes the package org.openqa.selenium.grid.node.k8s from selenium-server.jar. Here I found bazel build configurations for building docker images: https://github.com/SeleniumHQ/selenium/tree/trunk/deploys/docker

The firefox_node and chrome_node images are there declared to include a layer (called one-shot) that includes a library with that class. But these images and the library doesn't seem to be published publicly anywhere.

In https://github.com/SeleniumHQ/selenium/tree/trunk/deploys/k8s you can see how that library is utilized: https://github.com/SeleniumHQ/selenium/blob/451fc381325437942bc953e3f79facee9f2a3c22/deploys/k8s/firefox-node.yaml#L19-L44

It seems like the idea is that you checkout the code to build and deploy these images and k8s manifest to your local infrastructure.

msvticket avatar Jan 21 '22 11:01 msvticket

Thank you all for sharing your thoughts and offering paths to move forward. I will reply to the comments below.

diemol avatar Jan 21 '22 14:01 diemol

There is https://keda.sh/docs/2.4/scalers/selenium-grid-scaler/ which can autoscale nodes, it's working fine - the problem is with tearing down a node. Since it doesn't keep track of which node is working - it could kill test in progress, and it seems Chrome Node doesn't handle it gracefull.

Something new in Grid 4 is the "Drain Node" feature. With it, you can start draining the Node, so no new sessions are accepted, and when the last session is completed, the Node shutsdown. It gets tricky when the Node is inside a Docker container because supervisor does not exit, which is the point of https://github.com/SeleniumHQ/docker-selenium/issues/1435. I have not had the time to implement it, but hoping someone can contribute to it.

diemol avatar Jan 21 '22 14:01 diemol

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

To make a node exit after a session is done you need to add a property to to the node section of config.toml:

implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

With the official docker images this isn't enough since supervidord would still be running. So for that case you would need to add a supervisord event listener that finishes supervisord with its subprocesses.

One good thing with this approach is that combined with the video feature you get one video per session. Regarding graceful shutdown: In the dynamic grid code any video container is stopped before the node/browser container. So I guess the video file gets corrupted if Xvfb exits before ffmpeg is done saving the file. The event listener described above should therefore shutdown the supervisord in the video container before shutting down the one in the same container.

For shutting down supervisord, you can use the unix_http_server and supervisorctl features of supervisord. That works between containers in the pod as well.

I've also been thinking about how to have the video file uploaded to s3 (or similar) automatically. The tricky part is supplying the pod with the url to upload the file to. I have some ideas, but that have to wait until the basic solution is implemented.

I believe this comment has most of the information needed for this issue.

To shutdown the Node, we can either go with Draining or try to use the OneShotNode. The OneShotNode was an experiment and has not been tested throughly. Either way, if we end up using OneShotNode, we can see how to include it in the server jar.

Probably the things that need to be tackled are:

  • Exiting the container when the Node shutsdown, the above suggestion, to do it supervisor seems like a good path.
  • Stopping the video container gracefully, it could be right before or right after the session ends.

diemol avatar Jan 21 '22 15:01 diemol

Is there any activity on this issue? What's the recommendation? Is the keda bug still there?

Bjego avatar Feb 17 '22 08:02 Bjego

@diemol how about your suggestion here: https://github.com/SeleniumHQ/selenium/issues/7243 to document how to scale the nodes by a kubernetes cli? I guess people could spin up their own sidecar / cronjob to check the endpoints and to scale nodes. Js/Ts do have a pretty good kubernetes client lib - which should be good enough to scale pods.

Bjego avatar Feb 17 '22 08:02 Bjego

The containers now exit when the Node shutsdown. We still need to add a flag too the Node so it exits after X sessions.

diemol avatar Feb 17 '22 22:02 diemol