enterprise_gateway
enterprise_gateway copied to clipboard
[Feature] Support for configure magic on Spark Python Kubernetes Kernels (WIP)
Problem Statement
With JEG running on a remote machine and handling the kernel life cycle, Notebook users can not longer change the Kernels specs / properties locally which would update the configuration with which Spark kernel comes up. There are various use cases where users want to play around and experiment with different spark configuration to arrive at the final configs which best suit their workload. These configs also might vary from one notebook to another based on the workload the notebook is doing. JEG is also used as we multi-tenant service where each user might want to tweak the kernel based on his/ her scenario. Thus, there is a need for users to be able to update the kernel / spark properties at runtime from the notebook.
Feature Description
The changes proposed in this PR are to add support for a well known magic %%configure -f {}
which allows Notebook users to change the spark properties at runtime without having to create / update any kernel spec file. This would allow users to change spark driver, executor resources (like cores, memory), enable / disable spark configuration etc.
Example: The below snipped can be copied into a notebook cell to update the various spark properties associated with the current kernel.
%%configure -f
{
"driverMemory": "3G",
"driverCores" : "2",
"executorMemory" : "3G",
"executorCores" : "2",
"numExecutors" : 5,
"conf" : {
"spark.kubernetes.driver.label.test": "test-label"
}
}
Implementation Details
The below are the changes made at the high level:
- I have introduced a new API on JEG
POST api/configure/<kernel_id>
which accepts a payload similar to create kernel API. This API currently support updating the["KERNEL_EXTRA_SPARK_OPTS", "KERNEL_LAUNCH_TIMEOUT"]
env variables. - The above API tries to restart the same Kernel with the updated configuration. This is done because we want to keep the
kernel_id
same and want to give a smooth end user experience. - Once the old kernel goes away and a replacement comes up, we also need to refresh the ZMQ sockets to establish the connection with the new kernel so that existing active websocket connection from notebook / jupyterlab UI clients can continue to work. There hooks introduced to handle the same.
- Further, in order to complete the usual Jupyter Kernel messaging handshake, we fire the missing zmq messages from JEG to the websocket clients. Example: In order to mark the completion on the current cell, we need to send the
exec_reply
message and to mark the kernel idle, we need to kernelstatus=idle
messages etc . These messages are pre-generated on the kernel and sent to JEG while making the API call to refresh the kernel.
I will update more details about the changes and add some diagrams.
Testing
- Basic sanity testing done.
Note
Opening this PR for some early feedback and discussion on the changes.
@kevin-bates :
I am help is deciding the right terminology for the operation we are performing on the kernel using this new configure
API:
- do we call it " refreshing kernel" or "re-configuring kernel" ?
- do we change the api to
api/refresh/<kernel_id>
and call this "refreshing kerne" operation?
we need use this term in both logs and response messages. pls give this some thought
@kevin-bates : I am help is deciding the right terminology for the operation we are performing on the kernel using this new
configure
API:
- do we call it " refreshing kernel" or "re-configuring kernel" ?
- do we change the api to
api/refresh/<kernel_id>
and call this "refreshing kerne" operation?we need use this term in both logs and response messages. pls give this some thought
I guess refresh seems a little easier to understand than reconfigure. Does this imply the magic name would change to %%refresh
and does that conflict with existing magics? I think having the terminology match the magic name would be helpful.
I would also like to see the endpoint be under api/kernels
rather than a sibling to api/kernels
. Do you agree? If not, could you please help me understand why not? Is adding an endpoint under api/kernels
violating some kind of convention?
Hi @rahul26goyal - what is the status of this PR since it's been about 6 weeks since its last update and it seems there are a few things still to work out?