enterprise_gateway icon indicating copy to clipboard operation
enterprise_gateway copied to clipboard

Exchange files between kernel and notebook server

Open michzimny opened this issue 6 years ago • 9 comments

I have JupyterHub + EG in k8s, like in the blog post. My use cases:

  1. I write a code snippet in my notebook that gets executed in a kernel container. The execution is time-consuming and produces some output that my code snippet writes to a local file, i.e. a file in the kernel container. I'd like to have this output file pulled to the notebook server to put it in my persistent space, i.e. not to loose it when the kernel stops.

  2. I have some input files for a code snippet I want to execute in Jupyter. I have the input files locally, either on my computer or in the persistent space in my notebook server. I'd like to provide these input files to the kernel so my code snippet can access them during its execution.

So, is there a way to exchange such additional input/output files between a kernel container and an nb2kg container? If not, is there any other proposed solution for such cases?

michzimny avatar May 15 '19 22:05 michzimny

Yeah, this is the classic issue with remote kernels. Currently, you'll need to mount (via NFS or some other mechanism) the user's home directory (or wherever the files are located) in the kernel pod. This is accomplished by modifying the kernel-pod yaml/template (recently jinja support was added in this area) such that the appropriate mounting is taking place. Since the kernelspecs are, by default, embedded in the EG image, you'll want to mount them into EG so that you can manipulate the appropriate kernel-pod template

Also, I'd recommend use of EG_MIRROR_WORKING_DIRS when starting EG and KERNEL_WORKING_DIR such that this value be set in the jupyter hub spawner configuration to the user's working directory. This value would then get propagated to EG and used during the launch of the kernel-pod such that the kernel-pod's working directory will be switched to this value. Then, coupled with the previously mentioned mounting instructions, you should have a shared environment setup.

In these cases, its probably a good idea to use KERNEL_NAMESPACE, although Hub's lack of per-user namespaces is a bit of thorn since, at this time, all user kernels would wind up in the same namespace. So, I suppose in the meantime, not setting KERNEL_NAMESPACE and letting EG create a namespace for launched kernel is probably your best bet for ensuring better granularity of resources, etc.

Lots of stuff here and I don't really have time at the moment for more details, so please consider this a point in the right direction kind of thing.

Like HA, we consider this topic a priority for making the user experience easier. You can also search the issues (open and closed) and pull requests for more information - in addition to the docs. Thanks.

kevin-bates avatar May 16 '19 05:05 kevin-bates

@kevin-bates

I also have similar use case. I have multiple users in Notebook. My EG is running in different system. I need to restrict access of one user data with other user. Is there any way i can provide permission based access to user folders to restrict access between users?

ArindamHalder7 avatar May 24 '19 10:05 ArindamHalder7

For YARN configurations this can be addressed via Kerberos. For container configs, user-specific mounts can be used.

Which environment are you referring to?

cc: @lresende

kevin-bates avatar May 24 '19 14:05 kevin-bates

Thanks @kevin-bates

I have following environment.

  1. EG version 1.2, Python 3.7 and Spark 2.3.1. Non Kerberos Spark is running.
  2. Notebook will run as container for multiple user. Although now I am running as non docker. I will run this as docker.
  3. User specific Notebook have some local files, and want to create pandas DataFrame. Then load it to Spark for further processing.

How can I specify the user specific mount?

ArindamHalder7 avatar May 24 '19 17:05 ArindamHalder7

@ArindamHalder7 your issue seems a little different from the original use case described by @michzimny , particularly because he is using Hub + Kubernetes. Do you mind creating a new issue and describe more details about your scenario?

lresende avatar May 25 '19 19:05 lresende

Hi @lresende

Yes, I have created new issue #676 .

ArindamHalder7 avatar May 27 '19 08:05 ArindamHalder7

Hello! I am having the same issue, could anyone elaborate on a concrete example of how to use the "MIRROR_WORKING_DIRS" variable and a sample config to achieve the proper communication between the hub pod and the kernel pods? That would be useful for many of us who are not that experienced handling k8s servers. @lresende @kevin-bates

Thanks in advance! :)

fjferdiez avatar Apr 25 '20 04:04 fjferdiez

(I apologize for the delayed response) This probably isn't going to be the answer you were hoping for since you'll still need to apply the appropriate mount logic into the kernel-pod.yaml.j2 file.

When EG_MIRROR_WORKING_DIRS is enabled on EG, it merely instructs EG to pass along the value of KERNEL_WORKING_DIR to the pod launch machinery. This winds up invoking the kernel-pod.yaml file after the template values are applied. All this does is instructs the container to use KERNEL_WORKING_DIR as the value for the workingDir once the kernel is launched.

Let's assume the desired working directory is the user's home directory (e.g., /users/jdoe). Since you're using Jupyter Hub, let's also assume that directory has been mounted for use by the notebook server that is launched by Jupyter Hub. In addition, you should probably map the JUPYTERHUB_USER (not sure what the proper variable name is) to KERNEL_USERNAME and both of these have a value of jdoe. Also KERNEL_WORKING_DIR has a value of /users/jdoe.

The kernel-pod.yaml.j2 file has sections for specifying unconditional mounts - those mounts that span users. But can also be modified to have conditional mounts that might vary per user. Since working directories, in this case, are something mounted by every user (despite a portion of the name varying slightly) they would be considered unconditional, although they could be expressed in the conditional lists of KERNEL_VOLUME_MOUNTS and KERNEL_VOLUMES. As a result, you would need to massage the appropriate kernel-pod.yaml.j2 files to mount the user's directory in the pod.

I haven't tried this, but if you take the NFS example from the docs, you might end up with a kernel-pod.yaml.j2 file that looks something similar to this...

    image: "{{ kernel_image }}"
    name: "{{ kernel_pod_name }}"
    {% if kernel_working_dir is defined %}
    workingDir: "{{ kernel_working_dir }}"
    {% endif %}
    volumeMounts:
# Define any "unconditional" mounts here, followed by "conditional" mounts that vary per client
    - name: user_working_dir
      mountPath: "/users/{{ kernel_username }}"
    {% if kernel_volume_mounts is defined %}
      {% for volume_mount in kernel_volume_mounts %}
    - {{ volume_mount }}
      {% endfor %}
    {% endif %}
  volumes:
# Define any "unconditional" volumes here, followed by "conditional" volumes that vary per client
  - name: user_working_dir
    nfs:
      server: <internal-ip-of-nfs-server>
      path: "/users/{{ kernel_username }}"
  {% if kernel_volumes is defined %}
    {% for volume in kernel_volumes %}
  - {{ volume }}
    {% endfor %}
  {% endif %}

If you wanted to take the conditional approach, you might find @esevan's PR #629 helpful.

cc: @esevan @lresende for comments, tips.

kevin-bates avatar Apr 28 '20 04:04 kevin-bates

@kevin-bates thanks! This a very good starting point, I am going to give it a try.

fjferdiez avatar May 02 '20 08:05 fjferdiez