envd
envd copied to clipboard
feat(lang): Support data and code integration in envd-server runner
Descrption
I study the design docment and https://github.com/tensorchord/envd/pull/303. In a team work environment, there is no doubt that model, exp log and dataset would be allso a neccessary part in a work env setting up. There some tools and project aimmed to solve it such as DVC.
This issue is for discussion about:
- Should we support stroage for data sharing?
- What is the best way to achieve it?
Message from the maintainers:
Love this enhancement proposal? Give it a ๐. We prioritise the proposals with the most ๐.
Thanks for raising the discussions! There are some options for us:
- Integrate with DVC. But DVC is not a de-facto standard now and it also has some problems.
- Design the primitives in build.envd to declare expected locations of datasets/model/logs.
def build()
config.model(location="xx", from="https://xx.savedmodel"
Possible design:
runtime.docker.mount and other docker-specific funcs will be ignored in kubernetes context. runtime.kubernetes.mount() should create/use the corresponding PV/PVC in the backend pod.
runtime.kubernetes.mount("")
Write a proposal this week
Mount Code Proposal
There will be two ways to mount the code, directly clone git repo or use syncthing to sync folder insider container with user's local directory.
Reference: Syncthing in Okteto
Possible syntax design:
# Can declare globally
info.git_repo = "https://github.com/tensorchord/envd"
def build():
...
def build_k8s():
# Default entrypoint of envd server + k8s, if not found, use build function instead
# Will create a PVC and clone the git repo into it
runtime.kubernetes.mount_mode("git_repo")
# Or
# Will use syncthing to sync the folder insider container and user's local folder
runtime.kubernetes.mount_mode("Syncthing")
...
envd up under envd-server context will use build_k8s as the default function, if not found, will fall back to build function
Problem:
- Is envd-server strongly binded with k8s? How much details with k8s we want to expose here? Do we need extra abstraction at envd-server level?
Thanks for the proposal
Then do users need to write two build funcs for docker and k8s?
And, how to deal with the data part?
The original build func should work in most cases, but the pod on k8s may not be provisioned exactly as what user wants. Since build funcs won't contain any k8s specific logic, we can only have one default setting on k8s. User may need to tune details
How about use if statement in this case:
if platform == "k8s":
runtime.kubernetes.code(src="", dst="")
It's viable. I think the core difference here is that build_k8s design implies build function is the major way of using envd at local docker environment. Therefore any changes to run on the k8s won't change anything already exists, but just write a new function.
For data, ideally, user needs to register the dataset detail at the envd-server at first. And declare io.mount(src=dataset("imagenet"), dst="/data") in the build.envd. However this is a bit complex, easier way is just exposing PVC details, such as io.mount(src=k8s.pvc("imagenet"), dst="/data")
@kemingy @aseaday @zwpaper Do you have any question about it?
The initial version we can start with info.git_repo = "https://github.com/tensorchord/envd, and clone it into k8s pod. But syncthing will also be implemented later
I'm not sure why we need to declare a global info.git_repo object and use string like "git_repo" to refer to it.
For code sync, should we choose git clone or http?
How about use
ifstatement in this case:if platform == "k8s": runtime.kubernetes.code(src="", dst="")
This can save some boilerplate code. I'm wondering if we can hide the context keywords like docker or k8s. We can detect the context and choose the right implementation. If it's not implemented for such a context, just panic.
The difference between contexts can be handled by some configurations. For example:
if envd.context.type == "kubernetes":
envd.mount_host = pvc("user-data")
So the user can run the code in different contexts.
If we introduce build_k8s, that means we need to have more default build functions for different type of contexts in the future.
I am not very sure the condition statement will be a good choice. We still should insist on statement over implementation. Or at least we'd better use a override class or something like polymorphism to solve the problem.
I don't think separating the docker and k8s build is a good idea.
let me try to explain myself from a user perspective.
Assuming I was a researcher, I did not want to copy and edit my build() to build_k8s() only because I was switching my env from docker to k8s.
As we have context, we know which target env users were building for, and pvc is de facto standard of storage solution in k8s(even host path, we should use something like local provisioned to implement pv), maybe we can set it by default and offer an option to change.
The philosophy behind envd is that we encapsulate the complexity and make things easier.
So personally, I would vote for a solution to simplify things, like:
- we keep all build scripts only in
build() - we detect and switch from docker and k8s automatically
- we use some path, ~/.env/data/<NAME>/code for example, for docker, PVC for k8s by default
- we offer options to change the docker mount point or storage type, storage parameters, etc.
- the default case should satisfy most of the users, research did not or even do not want to care about mount point or PVC,PV
for example:
build():
runtime.code(src="", dst="")
build():
runtime.code(src="[email protected]:envd.git", dst="/code",
k8s_volume="host_path",
docker_volume="cache")
I don't think separating the docker and k8s build is a good idea.
let me try to explain myself from a user perspective.
Assuming I was a researcher, I did not want to copy and edit my
build()tobuild_k8s()only because I was switching my env from docker to k8s.As we have
context, we know which target env users were building for, andpvcis de facto standard of storage solution in k8s(even host path, we should use something like local provisioned to implement pv), maybe we can set it by default and offer an option to change.The philosophy behind envd is that we encapsulate the complexity and make things easier.
So personally, I would vote for a solution to simplify things, like:
- we keep all build scripts only in
build()- we detect and switch from docker and k8s automatically
- we use some path, ~/.env/data//code for example, for docker, PVC for k8s by default
- we offer options to change the docker mount point or storage type, storage parameters, etc.
- the default case should satisfy most of the users, research did not or even do not want to care about mount point or PVC,PV
for example:
build(): runtime.code(src="", dst="")build(): runtime.code(src="[email protected]:envd.git", dst="/code", k8s_volume="host_path", docker_volume="cache")
Make sense. It is promising.
Any opinions about it? @VoVAllen
I don't think separating the docker and k8s build is a good idea.
let me try to explain myself from a user perspective.
Assuming I was a researcher, I did not want to copy and edit my
build()tobuild_k8s()only because I was switching my env from docker to k8s.As we have
context, we know which target env users were building for, andpvcis de facto standard of storage solution in k8s(even host path, we should use something like local provisioned to implement pv), maybe we can set it by default and offer an option to change.The philosophy behind envd is that we encapsulate the complexity and make things easier.
So personally, I would vote for a solution to simplify things, like:
- we keep all build scripts only in
build()- we detect and switch from docker and k8s automatically
- we use some path, ~/.env/data//code for example, for docker, PVC for k8s by default
- we offer options to change the docker mount point or storage type, storage parameters, etc.
- the default case should satisfy most of the users, research did not or even do not want to care about mount point or PVC,PV
for example:
build(): runtime.code(src="", dst="")build(): runtime.code(src="[email protected]:envd.git", dst="/code", k8s_volume="host_path", docker_volume="cache")
We'd better provide a mount config base class to provide polymorphism config such as:
# mount_config.py
class MyMount:
def kubernetes(self):
self.k8s.runtime.use("pvc")
code_src = self.code.src
...
def host(self)
we could give a default logic about how to mount the code and the data. But it'd better be like a mount configuration class for differentcase. too much options in the function would cause difficulty to maintain.
runtime.code(src="[email protected]:envd.git", dst="/code", configuration=MyMount)
Thanks @aseaday and @zwpaper . My core consideration here is also to make build function unified for both k8s and docker. The key design here is to support ad hoc configuration on k8s. I think @aseaday 's suggestion of using separate configurations to declare the implementation is a good design. Starlark doesn't support class, but we can use dict here directly for simplicity.
How about
runtime.code(src="[email protected]:envd.git", dst="/code", configuration={"k8s": {"pvc": "code"}})
After team discussion:
- We think it's better to use the same function
buildand provide a way for user to identify current context - Use function such as
config.info(repo="https://github.com/tensorchord/envd")to declare git repo information. And user can declare likeruntime.mount(src=k8s.pvc('code'), dst=project_root())for advanced settings
Related issue: https://github.com/tensorchord/envd/issues/1054
Generally, I think PVC is a good abstraction fitting all data sources (nfs, oss, lustre, and so on) in enterprise-level usage. Therefore what we need to do here is just to decide what pvc to be mounted to the pod on k8s
- PVC as dataset
- User will specify by
runtime.mount(src=data.dataset('daily_sales'), dst="/data"). This will be stored into image label, and won't affect the image build process. When launch the environment, backend can implement how to mount the dataset. - When launch the environment on envd-server, it will send a request to
/k8s/datawith{"user": "allen", "name": "daily_sales", project_name: "sales_prediction"}, and got a response about the PVC information and insert it into pod yaml - We can also provide mechanisms to sync dataset between different backend if needed
- User will specify by
- HTTP dataset
- User can also specify
data.dataset("http://xxx/mnist.tar.gz"), and we can download it when launching the pod. - SQL access (Advanced usage) - If user want to query information from database, we can also inject related credential environment variable if needed
What's the difference between data.envd and data.dataset? Does data.dataset work without envd-server?
@kemingy It's the same, I just randomly picked a name. data.dataset will only add label to the image. And the runner will decide how to handle it. Therefore local_docker runner can also implement this. But there's no guarantee that the dataset at different backends are identical
Previous discussion: https://github.com/tensorchord/envd/pull/650