envd icon indicating copy to clipboard operation
envd copied to clipboard

feat(lang): Support data and code integration in envd-server runner

Open aseaday opened this issue 3 years ago โ€ข 23 comments

Descrption

I study the design docment and https://github.com/tensorchord/envd/pull/303. In a team work environment, there is no doubt that model, exp log and dataset would be allso a neccessary part in a work env setting up. There some tools and project aimmed to solve it such as DVC.

This issue is for discussion about:

  • Should we support stroage for data sharing?
  • What is the best way to achieve it?

Message from the maintainers:

Love this enhancement proposal? Give it a ๐Ÿ‘. We prioritise the proposals with the most ๐Ÿ‘.

aseaday avatar Jun 30 '22 03:06 aseaday

Thanks for raising the discussions! There are some options for us:

  • Integrate with DVC. But DVC is not a de-facto standard now and it also has some problems.
  • Design the primitives in build.envd to declare expected locations of datasets/model/logs.
def build()
  config.model(location="xx", from="https://xx.savedmodel"

gaocegege avatar Jun 30 '22 04:06 gaocegege

Possible design:

runtime.docker.mount and other docker-specific funcs will be ignored in kubernetes context. runtime.kubernetes.mount() should create/use the corresponding PV/PVC in the backend pod.

runtime.kubernetes.mount("")

gaocegege avatar Sep 30 '22 00:09 gaocegege

Write a proposal this week

VoVAllen avatar Oct 07 '22 07:10 VoVAllen

Mount Code Proposal

There will be two ways to mount the code, directly clone git repo or use syncthing to sync folder insider container with user's local directory.

Reference: Syncthing in Okteto

Possible syntax design:


# Can declare globally
info.git_repo = "https://github.com/tensorchord/envd"


def build():
    ...
    

def build_k8s():
    # Default entrypoint of envd server + k8s, if not found, use build function instead
    
    # Will create a PVC and clone the git repo into it
    runtime.kubernetes.mount_mode("git_repo")
    # Or
    # Will use syncthing to sync the folder insider container and user's local folder
    runtime.kubernetes.mount_mode("Syncthing")
    ...

envd up under envd-server context will use build_k8s as the default function, if not found, will fall back to build function

Problem:

  • Is envd-server strongly binded with k8s? How much details with k8s we want to expose here? Do we need extra abstraction at envd-server level?

VoVAllen avatar Oct 09 '22 07:10 VoVAllen

Thanks for the proposal

Then do users need to write two build funcs for docker and k8s?

And, how to deal with the data part?

gaocegege avatar Oct 09 '22 08:10 gaocegege

The original build func should work in most cases, but the pod on k8s may not be provisioned exactly as what user wants. Since build funcs won't contain any k8s specific logic, we can only have one default setting on k8s. User may need to tune details

VoVAllen avatar Oct 09 '22 12:10 VoVAllen

How about use if statement in this case:

if platform == "k8s":
    runtime.kubernetes.code(src="", dst="")

gaocegege avatar Oct 10 '22 02:10 gaocegege

It's viable. I think the core difference here is that build_k8s design implies build function is the major way of using envd at local docker environment. Therefore any changes to run on the k8s won't change anything already exists, but just write a new function.

VoVAllen avatar Oct 11 '22 09:10 VoVAllen

For data, ideally, user needs to register the dataset detail at the envd-server at first. And declare io.mount(src=dataset("imagenet"), dst="/data") in the build.envd. However this is a bit complex, easier way is just exposing PVC details, such as io.mount(src=k8s.pvc("imagenet"), dst="/data")

VoVAllen avatar Oct 11 '22 09:10 VoVAllen

@kemingy @aseaday @zwpaper Do you have any question about it?

gaocegege avatar Oct 12 '22 06:10 gaocegege

The initial version we can start with info.git_repo = "https://github.com/tensorchord/envd, and clone it into k8s pod. But syncthing will also be implemented later

VoVAllen avatar Oct 12 '22 08:10 VoVAllen

I'm not sure why we need to declare a global info.git_repo object and use string like "git_repo" to refer to it.

For code sync, should we choose git clone or http?

How about use if statement in this case:

if platform == "k8s":
    runtime.kubernetes.code(src="", dst="")

This can save some boilerplate code. I'm wondering if we can hide the context keywords like docker or k8s. We can detect the context and choose the right implementation. If it's not implemented for such a context, just panic.

The difference between contexts can be handled by some configurations. For example:

if envd.context.type == "kubernetes":
    envd.mount_host = pvc("user-data")

So the user can run the code in different contexts.

If we introduce build_k8s, that means we need to have more default build functions for different type of contexts in the future.

kemingy avatar Oct 12 '22 08:10 kemingy

I am not very sure the condition statement will be a good choice. We still should insist on statement over implementation. Or at least we'd better use a override class or something like polymorphism to solve the problem.

aseaday avatar Oct 12 '22 17:10 aseaday

I don't think separating the docker and k8s build is a good idea.

let me try to explain myself from a user perspective.

Assuming I was a researcher, I did not want to copy and edit my build() to build_k8s() only because I was switching my env from docker to k8s.

As we have context, we know which target env users were building for, and pvc is de facto standard of storage solution in k8s(even host path, we should use something like local provisioned to implement pv), maybe we can set it by default and offer an option to change.

The philosophy behind envd is that we encapsulate the complexity and make things easier.

So personally, I would vote for a solution to simplify things, like:

  1. we keep all build scripts only in build()
  2. we detect and switch from docker and k8s automatically
  3. we use some path, ~/.env/data/<NAME>/code for example, for docker, PVC for k8s by default
  4. we offer options to change the docker mount point or storage type, storage parameters, etc.
  5. the default case should satisfy most of the users, research did not or even do not want to care about mount point or PVC,PV

for example:

build():
    runtime.code(src="", dst="")
build():
    runtime.code(src="[email protected]:envd.git", dst="/code", 
                 k8s_volume="host_path",
                 docker_volume="cache")

zwpaper avatar Oct 12 '22 17:10 zwpaper

I don't think separating the docker and k8s build is a good idea.

let me try to explain myself from a user perspective.

Assuming I was a researcher, I did not want to copy and edit my build() to build_k8s() only because I was switching my env from docker to k8s.

As we have context, we know which target env users were building for, and pvc is de facto standard of storage solution in k8s(even host path, we should use something like local provisioned to implement pv), maybe we can set it by default and offer an option to change.

The philosophy behind envd is that we encapsulate the complexity and make things easier.

So personally, I would vote for a solution to simplify things, like:

  1. we keep all build scripts only in build()
  2. we detect and switch from docker and k8s automatically
  3. we use some path, ~/.env/data//code for example, for docker, PVC for k8s by default
  4. we offer options to change the docker mount point or storage type, storage parameters, etc.
  5. the default case should satisfy most of the users, research did not or even do not want to care about mount point or PVC,PV

for example:

build():
    runtime.code(src="", dst="")
build():
    runtime.code(src="[email protected]:envd.git", dst="/code", 
                 k8s_volume="host_path",
                 docker_volume="cache")

Make sense. It is promising.

gaocegege avatar Oct 13 '22 00:10 gaocegege

Any opinions about it? @VoVAllen

gaocegege avatar Oct 13 '22 04:10 gaocegege

I don't think separating the docker and k8s build is a good idea.

let me try to explain myself from a user perspective.

Assuming I was a researcher, I did not want to copy and edit my build() to build_k8s() only because I was switching my env from docker to k8s.

As we have context, we know which target env users were building for, and pvc is de facto standard of storage solution in k8s(even host path, we should use something like local provisioned to implement pv), maybe we can set it by default and offer an option to change.

The philosophy behind envd is that we encapsulate the complexity and make things easier.

So personally, I would vote for a solution to simplify things, like:

  1. we keep all build scripts only in build()
  2. we detect and switch from docker and k8s automatically
  3. we use some path, ~/.env/data//code for example, for docker, PVC for k8s by default
  4. we offer options to change the docker mount point or storage type, storage parameters, etc.
  5. the default case should satisfy most of the users, research did not or even do not want to care about mount point or PVC,PV

for example:

build():
    runtime.code(src="", dst="")
build():
    runtime.code(src="[email protected]:envd.git", dst="/code", 
                 k8s_volume="host_path",
                 docker_volume="cache")

We'd better provide a mount config base class to provide polymorphism config such as:

# mount_config.py
class MyMount:
  def kubernetes(self):
     self.k8s.runtime.use("pvc")
     code_src = self.code.src
     ...
  def host(self)

we could give a default logic about how to mount the code and the data. But it'd better be like a mount configuration class for differentcase. too much options in the function would cause difficulty to maintain.

 runtime.code(src="[email protected]:envd.git", dst="/code", configuration=MyMount)

aseaday avatar Oct 13 '22 06:10 aseaday

Thanks @aseaday and @zwpaper . My core consideration here is also to make build function unified for both k8s and docker. The key design here is to support ad hoc configuration on k8s. I think @aseaday 's suggestion of using separate configurations to declare the implementation is a good design. Starlark doesn't support class, but we can use dict here directly for simplicity.

How about

runtime.code(src="[email protected]:envd.git", dst="/code", configuration={"k8s": {"pvc": "code"}})

VoVAllen avatar Oct 13 '22 06:10 VoVAllen

After team discussion:

  • We think it's better to use the same function build and provide a way for user to identify current context
  • Use function such as config.info(repo="https://github.com/tensorchord/envd") to declare git repo information. And user can declare like runtime.mount(src=k8s.pvc('code'), dst=project_root()) for advanced settings

Related issue: https://github.com/tensorchord/envd/issues/1054

VoVAllen avatar Oct 20 '22 04:10 VoVAllen

Generally, I think PVC is a good abstraction fitting all data sources (nfs, oss, lustre, and so on) in enterprise-level usage. Therefore what we need to do here is just to decide what pvc to be mounted to the pod on k8s

  • PVC as dataset
    • User will specify by runtime.mount(src=data.dataset('daily_sales'), dst="/data"). This will be stored into image label, and won't affect the image build process. When launch the environment, backend can implement how to mount the dataset.
    • When launch the environment on envd-server, it will send a request to /k8s/data with {"user": "allen", "name": "daily_sales", project_name: "sales_prediction"}, and got a response about the PVC information and insert it into pod yaml
    • We can also provide mechanisms to sync dataset between different backend if needed
  • HTTP dataset - User can also specify data.dataset("http://xxx/mnist.tar.gz"), and we can download it when launching the pod.
  • SQL access (Advanced usage) - If user want to query information from database, we can also inject related credential environment variable if needed

VoVAllen avatar Nov 09 '22 08:11 VoVAllen

What's the difference between data.envd and data.dataset? Does data.dataset work without envd-server?

kemingy avatar Nov 09 '22 08:11 kemingy

@kemingy It's the same, I just randomly picked a name. data.dataset will only add label to the image. And the runner will decide how to handle it. Therefore local_docker runner can also implement this. But there's no guarantee that the dataset at different backends are identical

VoVAllen avatar Nov 09 '22 09:11 VoVAllen

Previous discussion: https://github.com/tensorchord/envd/pull/650

VoVAllen avatar Dec 22 '22 14:12 VoVAllen