k8s-spark-scheduler
k8s-spark-scheduler copied to clipboard
Pods scheduled stuck in Pending state
I'm attempting to run spark-thriftserver using this scheduler extender. If you're not familiar, spark-thriftserver runs in client mode (local driver, remote executors). The thrift server exposes a JDBC connection which receives queries and turns these into spark jobs.
The command to run this looks like:
/opt/spark/sbin/start-thriftserver.sh \
--conf spark.master=k8s://https://my-EKS-server:443 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.container.image=my-image \
--conf spark.kubernetes.file.upload.path=file:///tmp \
--conf spark.app.name=sparkthriftserver \
--conf spark.kubernetes.executor.podTemplateFile=/path/to/executor.template \
--verbose
spark-defaults.conf looks like:
spark.sql.catalogImplementation hive
spark.kubernetes.allocation.batch.size 5
spark.shuffle.service.enabled true
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout 30s
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 50
So far, I've applied the extender.yaml file as-is without any modifications. This instantiates two new pods under the spark namespace both in Running state with names starting with "spark-scheduler-". describe pod XXX
yields some troubling information about them:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 15m fargate-scheduler Successfully assigned spark/spark-scheduler-7bbb5bb979-fhktn to fargate-ip-XXX-XXX-XXX-XXX.ec2.internal
Normal Pulling 15m kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Pulling image "gcr.io/google_containers/hyperkube:v1.13.1"
Normal Pulled 15m kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Successfully pulled image "gcr.io/google_containers/hyperkube:v1.13.1"
Normal Created 14m kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Created container kube-scheduler
Normal Started 14m kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Started container kube-scheduler
Normal Pulling 14m kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Pulling image "palantirtechnologies/spark-scheduler:latest"
Normal Pulled 14m kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Successfully pulled image "palantirtechnologies/spark-scheduler:latest"
Warning Unhealthy 14m (x3 over 14m) kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Liveness probe failed: Get https://XXX.XXX.XXX.XXX:8484/spark-scheduler/status/liveness: dial tcp XXX.XXX.XXX.XXX
:8484: connect: connection refused
Normal Killing 14m kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Container spark-scheduler-extender failed liveness probe, will be restarted
Normal Created 14m (x2 over 14m) kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Created container spark-scheduler-extender
Normal Pulled 14m kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Container image "palantirtechnologies/spark-scheduler:latest" already present on machine
Normal Started 14m (x2 over 14m) kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Started container spark-scheduler-extender
Warning Unhealthy 14m (x4 over 14m) kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Readiness probe failed: Get https://XXX.XXX.XXX.XXX:8484/spark-scheduler/status/readiness: dial tcp XXX.XXX.XXX.XXX:8484: connect: connection refused
Warning Unhealthy 14m kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal Liveness probe failed: HTTP probe failed with statuscode: 503
When I attempt to run the driver above (which launches properly), because the spark.dynamicAllocation.minExecutors
is set to 1
the driver immediately requests a single executor pod at startup. The pod itself remains indefinitely in a pending state.
describe pod XXX
seems to suggest that no nodes satisfy the pod's scheduling criteria:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m49s (x37 over 14m) spark-scheduler 0/4 nodes are available: 4 Insufficient pods.
What I'm having trouble figuring out is:
- what exactly is the criteria which causes no nodes to be insufficient? I am not making use of any
instance-group
labels, nor any custom labels. All the nodes accept the spark namespace. sorry to ask but I am struggling to find the proper steps to take to narrow down the issue. - are the "liveness" error messages above signifying the issue resides with an unhealthy scheduler? I can ssh into these two scheduler instances if needed but not sure what logs to take a closer look at after i open the shell.
If it helps, this is using aws fargate as the compute resources behind kubernetes, but based on what i know so far that shouldnt be an issue.
Unfortunately spark scheduler extender currently doesn't support launching client mode applications to kubernetes. It assumes that a driver will be launched in the cluster, which then proceeds to request executors.
That being said, I think your executor pods are failing to be scheduled before consulting with the extender though, as the message says 4 Insufficient pods
, which is kube-scheduler's way of telling you that all 4 nodes in your cluster are over their pod count limit.
If you have fixed that by increasing your pod limit or killing existing pods, then I would expect your pods to be still stuck at pending, but with a message telling you something like failed to get resource reservations
as it will be looking for the spaces that the driver reserved, which doesn't happen in client mode.
for your second question, I think it is got to do with a network problem from the health probe into your container, because the message for the stuck pod indicates that kube-scheduler considered that pod, hence is operational
Unfortunately spark scheduler extender currently doesn't support launching client mode applications to kubernetes. It assumes that a driver will be launched in the cluster, which then proceeds to request executors.
That nuance wasnt clear to me but now that it is I think I can work with this. Good to know thanks.
That being said, I think your executor pods are failing to be scheduled before consulting with the extender though, as the message says 4 Insufficient pods, which is kube-scheduler's way of telling you that all 4 nodes in your cluster are over their pod count limit. If you have fixed that by increasing your pod limit or killing existing pods, then I would expect your pods to be still stuck at pending, but with a message telling you something like failed to get resource reservations as it will be looking for the spaces that the driver reserved, which doesn't happen in client mode.
This is actually how AWS fargate works as a resource negotiator. Hardware is allocated on-demand, always one-node-per-pod. For example, say spark requests resources for a new executor. This of course begets a request to kubernetes for an executor pod. In the case of fargate this begets a request to allocate a new VM just-in-time for the lifetime of the executor, billed by the second. In 60-90 seconds (usually) fargate returns a new VM with kubernetes tooling pre-installed/configured sized to the request plus some extra RAM for kubelet.
When running kubectl get nodes
I can see the new node for the requested pod provisioned as expected. But there's something about this new node/VM the scheduler extender rejects. I can go into more detail, even a step-by-step demonstration if that helps elaborate. But the only key point I want to make is that there might be something different about this cloud-based node-allocation behavior which doesnt jive with the scheduler extender, at least not without customization.
for your second question, I think it is got to do with a network problem from the health probe into your container, because the message for the stuck pod indicates that kube-scheduler considered that pod, hence is operational
I dont have a good response for this point. Within the VLAN containing nodes there are no current restrictions for cross-node communication. It seems the "connection refused" errors come from requests where the client and server are the same IP. This might be an oversight I can find by looking closer.