amazon-asdi
amazon-asdi copied to clipboard
EKS example is not working
I have tried many times now to implement the EKS example but it is not working. After following all the steps and siginin in with the username and password the jupyterhub is just stuck. It shows 0 on the console log and then after 5 minutes 100 but failed.
@edonD thank you for bringing this to my attention, and apologies that you wasted time having to debug this.
I went through the solution and was able to replicate the issue you saw.
The jupyter notebook environment was not able to start because the pod was not able to be scheduled onto a node in the EKS cluster.
I found this thread elsewhere where other users were running into the same issue:
In this thread, multiple users confirmed that changing this setting solved their issue:
scheduling.userScheduler.enabled = false
To implement this back in the EKS example project here, one can add a block to the 'daskhub.yaml' file so that the beginning of the file looks like this:
jupyterhub:
scheduling:
userScheduler:
enabled: false
I have also gone through the solution and updated the software packages to the latest versions of EKS, eksctl, etc. Pull requests pending before these changes are all merged back into the solution.
@ethanfah I just merged all the PRs in, so latest should be in place.
@edonD if you get a chance to rerun with the latest and confirm it's working, just leave a comment here or go ahead and close the issue. Thanks!
Wow, impressed with the fast feedback from you guys. I will try it out and let you know. Thank you nonetheless!
The configuration is smooth and works perfectly. The only problem which I can see until now is that it somehow doesnt create the gateway cluster. I am using the cmip6_zarr.ipynb example. It gets stuck in the
cluster = GatewayCluster(worker_cores=0.8, worker_memory=3.3)
cluster.scale(32)
client = cluster.get_client()
cluster
I tried to initiate it with less workers but it is still the same. I guess this can have many reasons so if it works on your setup I can spend some more time and try to debugg it.
The configuration is smooth and works perfectly. The only problem which I can see until now is that it somehow doesnt create the gateway cluster. I am using the cmip6_zarr.ipynb example. It gets stuck in the
cluster = GatewayCluster(worker_cores=0.8, worker_memory=3.3) cluster.scale(32) client = cluster.get_client() cluster
I tried to initiate it with less workers but it is still the same. I guess this can have many reasons so if it works on your setup I can spend some more time and try to debugg it.
If you got as far as logging into the notebook, then that means that at least the minimum number of EC2 instances were provisioned correctly. At the step that you are running into issues with, it requires additional EC2 instances to be created . The way it works is that dask worker pods are "scheduled", and then because the cluster will not have enough room to fit all of the scheduled pods, the cluster autoscaler will step in and create additional nodes for those pods to get scheduled on.
So the first thing to check is whether all of those dask work pods are scheduled or not. If they are scheduled, then the next thing to check is whether the cluster autoscaler was installed correctly such that it is trying to create more EC2 instances. The process of instantiating more EC2 instances can take a while, sometimes 5-10 minutes.
And here is the other thing to keep in mind. The default configuration for this solution uses Spot Intances for the worker nodes. Spot instances are not always available! So in some cases, you could find that your EC2 instances are not instantiating simply because there are not enough instances in your region/AZ, and the solution would work if you tried it the next day.
Let me know what you find, hoping to help you get this working.