mlcube_examples icon indicating copy to clipboard operation
mlcube_examples copied to clipboard

Config 2.0 unified storage description

Open xyhuang opened this issue 3 years ago • 1 comments

This proposes a unified storage description for config 2.0.

Today, MLCube relies on a simple "file path" approach to describe the inputs and outputs of their tasks. However, for many platforms, such like Kubernetes, it is not possible to use a single file path, because they either have complex storage backend, or use their own layer of storage abstractions, which do not use "paths" to refer to the corresponding locations in the data storage. This proposal aims to address this problem by providing a unified way of describing storage that can cover both local file systems and more complex storage solutions.

A storage backend can be described in the platform section of the config, which is supplied by the user at run-time. The storage description consists of 2 main parts: a name that will be used as a reference in the tasks' I/O paths, and a platform-specific spec that provides the details of the storage backend in the target platform, so that the runner can use it to find the right location of data. We do not change the "path"-like descriptions of task inputs/outputs in order to keep that simple, however, we do introduce a "variable"-like component as a part of the path, so that we can use this "variable" as a reference to the corresponding storage backend and use the rest of the path as a relative path to the given storage. A most straight-forward example of such "variable" is "$WORKSPACE" which is currently being used to refer to a specific dir in local file system. With the new proposal, the "$WORKSPACE", or any "$CUSTOM_NAME" defined by user, can refer to an arbitrary storage backend as specified in the platform section.

Since the detailed spec of the storage is specified in the platform part, it can be decoupled from the shared MLCube config and only appear in the user's config. This also means that how the spec of a given storage backend is writtern should be agreed between a user and a runner, and not relevant to the MLCube publisher. While we do not have to provide a standard for that specs, we may provide some "guidelines/examples" for popular platforms so that there can be a convention for runner implementors.

The following is an example of how the storage backend can be defined, notice the specs in the platform section and how they are used in the tasks section. Notice also that if we give the storage a name of "WORKSPACE" then we may redirect our default workspace to the specified storage backend, without change the values in the task I/Os.

name: example-mlcube
platform:
  storage:
  - name: K8S_DATA
    spec:
      kubernetes:
        pvc_name: my-pvc
  - name: NFS_DATA
    spec:
      nfs:
        host: 127.0.0.1
        port: 2049
        path: some/nfs/path
container:
  image: mlcommons/mnist:0.0.1
  build_context: "mnist"
  build_file: "Dockerfile"
tasks:
  download:
    io:
    - {name: data_dir, type: directory, io: output, default: $NFS_DATA/data}
    - {name: log_dir, type: directory, io: output, default: $NFS_DATA/logs}
  train:
    io:
    - {name: data_dir, type: directory, io: input, default: $K8S_DATA/data}
    - {name: parameters_file, type: file, io: input, default: $K8S_DATA/parameters/default.parameters.yaml}
    - {name: log_dir, type: directory, io: output, default: $K8S_DATA/logs}
    - {name: model_dir, type: directory, io: output, default: $K8S_DATA/model}

xyhuang avatar Jul 23 '21 08:07 xyhuang

@xyhuang @bitfort @dfeddema @davidjurado

This really looks doable. Couple comments.

  1. The more I think about value format (e.g. ${STORAGE_NAME}/RELATIVE_PATH ($NFS_DATA/data)), the more I am getting convinced that this may be confusing. Is NFS_DATA data an internal variable, environmental variable or some other identifier (like in our case).
  2. Should storage section be a list or a dictionary?

Talking about first item. Since values here are some kind of identifiers for either directories or files, can we partially adapt URI approach? We can introduce a schema named storage that will be referring to MLCube-supported storages listed in user (most likely) configuration file? Here is the example (I use dictionary instead of list just to provide an alternative approach):

name: example-mlcube
platform:
  storage:
    K8S_DATA:
      spec:
        kubernetes:
          pvc_name: my-pvc
    NFS_DATA: 
      spec:
        nfs:
          host: 127.0.0.1
          port: 2049
          path: some/nfs/path
    workspace:
      spec:
        local:
          path: ${runtime.root}/workspace
    tmp:
      spec:
        local:
          path: ${oc.env:TMP}/mlcube/workspace/${name}
    home: 
      spec:
        local:
          path: ${oc.env:HOME}/.mlcube/workspace/${name}
container:
  image: mlcommons/mnist:0.0.1
  build_context: "mnist"
  build_file: "Dockerfile"
tasks:
  download:
    io:
    - {name: data_dir, type: directory, io: output, default: "storage:NFS_DATA/data"}
    - {name: log_dir, type: directory, io: output, default: "storage:NFS_DATA/logs"}
  train:
    io:
    - {name: data_dir, type: directory, io: input, default: "storage:K8S_DATA/data"}
    - {name: parameters_file, type: file, io: input, default: "storage:K8S_DATA/parameters/default.parameters.yaml"}
    - {name: log_dir, type: directory, io: output, default: "storage:K8S_DATA/logs"}
    - {name: model_dir, type: directory, io: output, default: "storage:K8S_DATA/model"}

We can also use it with --workspace CLI argument. If users just specify relative paths (which are relative to workspace root by default) (data, logs, model, ...), then we can do something like:

mlcube run ... --workspace=storage:home

to keep data in user's home directory.

sergey-serebryakov avatar Aug 06 '21 06:08 sergey-serebryakov