elyra icon indicating copy to clipboard operation
elyra copied to clipboard

Output Files That Exists Beget A FileNotFoundError

Open cupdike opened this issue 2 years ago • 10 comments

The training node tries to save a model file to Minio but reports FileNotFound when the file does exist at the intended path.

To Reproduce Configure the pipeline to mount the PVC of the "authoring" Jupyter Notebook at /home/jovyan in the node images. Create a node with Output Files set to model/model.pt and write a file to that location (specifically, /home/jovyan/model/model.pt where the pipeline and node notebooks are in /home/jovyan).

image

NOTE: I observe that the current directory in the node at runtime is NOT /home/jovyan as I would expect:

Incoming CWD: /workspace/jupyter-work-dir\

However, I am forcing it to that so things run:

Adjusted CWD: /home/jovyan

If the upload to minio is still happening relative to that unexpected current directory, that might explain why the upload attempt gets a FileNotFound (and then the question is why is the current directory not /home/jovyan.

Error from "training" node:

Traceback (most recent call last):
  File "bootstrapper.py", line 697, in <module>
    main()
  File "bootstrapper.py", line 687, in main
    file_op.execute()
  File "bootstrapper.py", line 367, in execute
    raise ex
  File "bootstrapper.py", line 359, in execute
    self.process_outputs()
  File "bootstrapper.py", line 148, in process_outputs
    self.process_output_file(file.strip())
  File "bootstrapper.py", line 331, in process_output_file
    self.put_file_to_object_storage(matched_file)
  File "bootstrapper.py", line 309, in put_file_to_object_storage
    self.cos_client.fput_object(bucket_name=self.cos_bucket, object_name=object_to_upload, file_path=file_to_upload)
  File "/home/jovyan/.local/lib/python3.8/site-packages/minio/api.py", line 981, in fput_object
    file_size = os.stat(file_path).st_size
FileNotFoundError: [Errno 2] No such file or directory: 'model/model.pt'

Logs from the trainer image indicate that the file does exist at the intended location:

model_path /home/jovyan/model model_file_path = model_path / "model.pt" torch.save(model.state_dict(), model_file_path) print("#### Does model file exist?", model_file_path.exists()) '#### Does model file exist? True\n'

Expected behavior The file should get saved to Minio.

Deployment information Elyra 3.9.1 Kubeflow Pipelines 1.0.4

cupdike avatar Jun 23 '22 19:06 cupdike

Hi @cupdike! It appears that you have defined the data volume mount as a pipeline default property, which makes the mounted directory (and all output files your training job produces) implicitly available to other nodes that are executed after this node. You therefore don't need to declare any output files unless your goal is to have them also uploaded to the minio bucket that you've configured in your runtime configuration. Can you please clarify what your intention is? Thank you!

ptitzler avatar Jun 23 '22 20:06 ptitzler

~~Also could you let us know how kubeflow pipelines is being deployed? Im assuming this is in openshift with the crio container runtime given the appearance of the jupyter-work-dir directory name which is default working directory name created for this env.~~

~~Prior to introducing improved volume mount support to elyra, writing to the default home directory wasn't possible when using crio without a mounted volume workspace, so we created a temporary emptydir volume as a scratchpad. Rather than mounting directly to /home/jovyan where files could already exist as part of the image build, we mounted the workspace to /opt/app-root/src/jupyter-work-dir. See elyra/kfp/operator.py#L150-L166~~ Ignore if not using crio

The boostrapper seems to default to the current directory to look for files to upload when using the Outputs parameter. the current directory resolves to where the image decides to put us at startup and is same place where we subsequently pull in the bootstrapper script. Itll be a bit more discussion if we want to alter the behavior and allow users to push artifacts from different locations.

akchinSTC avatar Jun 23 '22 20:06 akchinSTC

@ptitzler Yes, the goal was to upload to minio to simulate the situation where a different system component would need access to the finished model (e.g. such as for deployment as an api inference service).

@akchinSTC Not using Crio. It's a DeepOps distribution of Kubeflow which is pretty close to a plain vanilla version AFAICT. I'm not necessarily asking for any changes, I just need to know how to get it working as it seems like the guidance that you just put all your files in a relative path to the notebook doesn't work in this situation--as documented here:

File dependencies must be located in the notebook/script file directory or a subdirectory of that location. per https://elyra.readthedocs.io/en/v3.9.0/user_guide/best-practices-file-based-nodes.html#file-input

cupdike avatar Jun 24 '22 11:06 cupdike

@cupdike - So I think I've been able to to recreate the problem you're having. Could you try copying any artifacts you want pushed to S3 into the original cwd (/workspace/jupyter-work-dir)? In the example above this would probably be something like

import subprocess

subprocess.run(["mkdir", "-p",  "/workspace/jupyter-work-dir/model"])
subprocess.run(["cp", "/home/jovyan/model/model.pt", "/workspace/jupyter-work-dir/model/"])

akchinSTC avatar Jun 24 '22 17:06 akchinSTC

Yes, this works: !mkdir -p /workspace/jupyter-work-dir/model && cp /home/jovyan/model/model.pt /workspace/jupyter-work-dir/model/

FWIW, it is very unintuitive to have the working directory be anything other than /home/jovyan

cupdike avatar Jun 24 '22 18:06 cupdike

It also works just to mount the PVC to that directory under Pipeline Properties IFF you put your project directly in /home/jovyan (cannot be in a subdir):

/workspace/jupyter-work-dir=my_pvc

It's "cleaner", but:

  1. Places a requirement on where your project files are (can not use a subdir as a project home).
  2. Does it mess up other functionality I'm not using?
  3. How do people know when they should do this and to what directory it should be mounted?
  4. Cruft starts collection in your directory (bootstrap.py, requirements files).

cupdike avatar Jun 24 '22 18:06 cupdike

A potentially clever/opinionated way to handle this would be apply a subpath to the PVC mount using the home-relative location of the .pipeline file as the subpath: https://kubernetes.io/docs/concepts/storage/volumes/#using-subpath

It could pretty much just be a checkbox option that would do this submount to whatever the CWD is. That way, users could have multiple projects in different subdirectories, but whenever they run a pipeline, it will be like their project files are in the CWD on the running pod (whatever that CWD might be). Maybe that's too prescriptive but figured I'd air it out...

cupdike avatar Jun 24 '22 19:06 cupdike

Not sure your proposal is more intuitive, as that requires an intimate knowledge of how the pipeline artifacts are stored in relation to each other. If an absolute mount path is used instead there is no ambiguity. For example, a notebook/script could create a subdirectory in a mounted fs (e.g./mnt/vol1/project1). The subpath (project1) could be passed to the notebooks/scripts as an environment variable, to keep things flexible.

ptitzler avatar Jun 24 '22 19:06 ptitzler

Not sure how much bandwidth I want to spend on this, but... if the user keeps all their pipeline files (notebooks, data directories, etc) under their project directory, and a PVC from an existing Jupyter Server (aka in Kubeflow) is used, they pretty much don't need to know anything. That is, they check the box, the PVC gets mounted into whatever the CWD ends up being for the node (even the unexpected /workspace/jupyter-work-dir) and the subdir mount makes it like their notebooks and data directories are just there directly in the CWD... so the object upload would work as expected without copying around files as was suggested by @akchinSTC . They wouldn't need to do a thing besides making sure all their project stuff lives under their subdirectory. If they have a projectA directory and a projectB directory, they could run a .pipeline file from either one and it would just work. It could be a special case for Kubeflow Jupyter Server PVC mounts perhaps.

The thing is, the way things currently work--with a spurious CWD being used like /workspace/jupyter-work-dir, there is no clean way to get the object upload to work without copying files to a path under the CWD... which is not ideal (and not at all obvious why one gets a FileNotFound error).

cupdike avatar Jun 24 '22 20:06 cupdike

Actually, I can get pretty close to what I described by moving the pipeline file up to the home directory and have the notebooks and data dirs in the accompanying subdirectories: image

Then just have this as the pipeline property for the Data Volume: /workspace/jupyter-work-dir=<kubeflow jupyter pvc>

Then clicking on either pipeline will run it without doing anything else (e.g. Running the ProjectB pipeline).

Directories and files output relative to the ProjectB dir still live under there: image

The problem is that all the intermediate files pollute the home dir because of the mount: image

It gets a bit ugly this way, but hopefully it demonstrates the concept.

I agree it would be better to be able to just use an environment variable or something like that--just that there's no way to do that presently with the way the CWD is used for file uploads.

cupdike avatar Jun 24 '22 21:06 cupdike