envd icon indicating copy to clipboard operation
envd copied to clipboard

feat(lang): Support serving

Open gaocegege opened this issue 3 years ago โ€ข 2 comments

Description

envd is designed to be the development environment for training. But I am thinking if we should support serving.

There are a few things we need to confirm:

  • The pain points of environment management in serving
  • The current experience
  • How to integrate with serving servers
  • How to integrate with model monitoring

gaocegege avatar May 18 '22 00:05 gaocegege

cc @kemingy @VoVAllen

gaocegege avatar May 18 '22 01:05 gaocegege

The serving environment has the same problem: it's hard for the data scientists to write a Dockerfile with the best practice and integrate it with the CI/CD pipeline.

Usually, it's offered by the AI infra team, but they may not be able to review all the changes.

Requirements:

  • non-root
  • CUDA runtime & version (may need to run ldconfig)
  • required Python packages (installed from pip or conda)
  • environments (optional, like OMP_NUM_THREADS, MKL_NUM_THREADS, CUDA_VISIBLE_DEVICES, etc.)
  • where to get the model data
  • entrypoint (tini or some other tools)

kemingy avatar Aug 05 '22 11:08 kemingy

I checked the current interface and implementation. There are several things we need to consider.

serving requirements

  • minimal base image: cuda-runtime like nvidia/cuda:11.6.2-runtime-ubuntu20.04
  • conda environment (optional)
  • minimal Python: created from conda

base image: exclude from dev env

  • base apt packages like vim, git, mercurial, wget, curl, sudo, zsh
  • starship
  • horust
  • envd-sshd

python image: need to refactor the logic

  • jupyter
  • vscode
  • starship prompt
  • git config
  • entrypoint

diff with custom

will be a breaking change but may not really affect the current users

  • non-root user and home dir
  • python
  • PATH

implementation

  • Need to refactor the Graph.Compile() related functions. Compiling dev/serving/custom env should be distinct functions, so no more if g.Image != nil everywhere
  • We maintain only one base image remote cache for each default configuration.
    • dev CPU
    • dev GPU
    • serving CPU
    • serving GPU

interface

  • As we discussed in https://github.com/tensorchord/envd/issues/1179, os and image will be unified to image with the standard format like tensorchord/python:latest. We also need a new parameter in base to indicate that this environment is for development or online serving.
  • We may need a new parameter to indicate the env requires conda (default=True), related to https://github.com/tensorchord/envd/issues/672
  • Serving env looks similar to custom env, but not 100%. For example, we will provide the Python env but custom env should maintain its own

others

  • we should provide more functions in envdlib like pytorch(), tensorflow_gpu(), tensorflow_cpu()
  • these changes may also affect Julia and R lang environments

kemingy avatar Nov 17 '22 09:11 kemingy

Another important point about serving is log. I think we should consider it .

aseaday avatar Nov 17 '22 11:11 aseaday

Should we introduce a new CLI to support the serve?

Maybe envd deploy or something else.

gaocegege avatar Nov 17 '22 12:11 gaocegege

To make it clean, here is a diagram of the current dev env build graph:

image

kemingy avatar Nov 18 '22 03:11 kemingy

Another important point about serving is log. I think we should consider it .

Which kind of logger/collectors are you using? I used to try the GCP StackDriver, users only need to log stdout/stderr in JSON format.

Should we introduce a new CLI to support the serve?

Maybe envd deploy or something else.

SGTM.

kemingy avatar Nov 18 '22 03:11 kemingy

Maybe we could design a interface such as:

runtime.log(driver="logstash", ...)

aseaday avatar Nov 18 '22 05:11 aseaday

I think in serving scenario, logs are handled by k8s. Such as a deamonset, or a sidecar container to send the logs to remote API endpoint. Is there anything special that envd needs take care of? I think the only thing we need to support here is the log file path?

VoVAllen avatar Nov 18 '22 05:11 VoVAllen

I think in serving scenario, logs are handled by k8s. Such as a deamonset, or a sidecar container to send the logs to remote API endpoint. Is there anything special that envd needs take care of? I think the only thing we need to support here is the log file path?

You are right.

aseaday avatar Nov 18 '22 06:11 aseaday

I'm trying to make it compatible with the current interface.

A draft for the envd file:

envdlib = include("https://github.com/tensorchord/envdlib")

requirements = [
    "torch --extra-index-url https://download.pytorch.org/whl/cpu",
    "torchvision",
    "numpy",
    "mosec",
    "msgpack",
    "Pillow",
]

dev = requirements + ["httpx"]


def build():
    base(language="python3")  # will use the default dev image
    install.python_packages(name=dev)
    envdlib.tensorboard(host_port=8888)
    runtime.expose(envd_port=8000, host_port=8000, service="service")


def serving():
    base(language="python3", dev=False, conda=False)  # will use the default CPU serving image
    install.python_packages(name=requirements)


def custom():
    base(image="rust:bullseye", dev=True, conda=True)
    install.python_packages(name=dev)
    install.system_packages(name=["ripgrep"])

This uses the resnet50 example from mosec:

  • service: https://github.com/mosecorg/mosec/blob/main/examples/resnet50_server_msgpack.py
  • client: https://github.com/mosecorg/mosec/blob/main/examples/resnet50_client_msgpack.py

kemingy avatar Nov 21 '22 10:11 kemingy

Should we make dev and conda in base? Or should we split base to several funcs?

gaocegege avatar Nov 21 '22 14:11 gaocegege

Should we make dev and conda in base? Or should we split base to several funcs?

I think the dev is highly related to the base image. conda could be another function.

kemingy avatar Nov 22 '22 03:11 kemingy

dev=True, conda=True looks a bit weird. Can we merge them?

Possible proposal:

  • builtin_packages=dev() and dev is an internal function which returns a list of packages such as ['sshd', 'conda']

VoVAllen avatar Nov 22 '22 10:11 VoVAllen

  • builtin_packages=dev() and dev is an internal function which returns a list of packages such as ['sshd', 'conda']

What should users do with the returned packages? Or should we provide the all-in-one dev_environment()?

kemingy avatar Nov 22 '22 10:11 kemingy

I think we only embed the sshd and starship for users. Are there any other packages?

gaocegege avatar Nov 22 '22 10:11 gaocegege

I think we only embed the sshd and starship for users. Are there any other packages?

Check my comment above: https://github.com/tensorchord/envd/issues/157#issuecomment-1318377491

kemingy avatar Nov 22 '22 10:11 kemingy

  • builtin_packages=dev() and dev is an internal function which returns a list of packages such as ['sshd', 'conda']

What should users do with the returned packages? Or should we provide the all-in-one dev_environment()?

It's just the default options. User can do anything they want. Common scenario I think will be custom base image + nothing extra from envd

VoVAllen avatar Nov 22 '22 11:11 VoVAllen

  • builtin_packages=dev() and dev is an internal function which returns a list of packages such as ['sshd', 'conda']

What should users do with the returned packages? Or should we provide the all-in-one dev_environment()?

It's just the default options. User can do anything they want. Common scenario I think will be custom base image + nothing extra from envd

The problem is that this may return a lot of packages. That will introduce lots of boilerplate code. I think it should be easy to set up the common basic environment for dev and serving. Hopefully in one or two lines. Then they can add whatever they need.

kemingy avatar Nov 22 '22 11:11 kemingy

What do you mean by boilerplate code? And what would the common basic environment look like?

VoVAllen avatar Nov 22 '22 11:11 VoVAllen

What do you mean by boilerplate code? And what would the common basic environment look like?

Do you mean users should pop items from builtin_packages=dev() to disable them? That's hard to implement for the current design. Because we don't use any returned values except for include. Otherwise, users need to specify which one they want to use, that's what I mean by boilerplate code.

kemingy avatar Nov 22 '22 11:11 kemingy

User can directly do buildin_packages=[] if nothing needed, or buildin_packages=['conda'] if he needs conda. However the possible item here will be limited

VoVAllen avatar Nov 22 '22 11:11 VoVAllen

User can directly do buildin_packages=[] if nothing needed, or buildin_packages=['conda'] if he needs conda. However the possible item here will be limited

But this is weird for starlark. Others work like a function, but this is a global variable if I understand correctly.

kemingy avatar Nov 22 '22 11:11 kemingy

Actually I mean base(buildin_packages=[]) instead of base(dev=True, conda=True) ๐Ÿ˜ฒ

VoVAllen avatar Nov 22 '22 11:11 VoVAllen

Actually I mean base(buildin_packages=[]) instead of base(dev=True, conda=True) astonished

Oh, I see. But it's not just some packages. For example, you cannot use runtime.daemon for serving because usually, you won't install horust for serving. This means, in this flexible way, there will be tons of if-else in the implementation everywhere. Because we're not able to know how many features users enabled for this build unless we loop against all the possibilities.

kemingy avatar Nov 23 '22 03:11 kemingy

https://github.com/tensorchord/envd/issues/541#issuecomment-1301581009

gaocegege avatar Nov 23 '22 06:11 gaocegege

Idealy, we don't need to distinguish among dev, serving, custom envs.

We should provide the atomic functions like:

  • dev_tools()
  • install.conda()
  • install.python()
  • etc.

Users can choose what they like. We only maintain the default dev env remote cache. (It's only useful for the 1st build)

I'll be working the implementation.

kemingy avatar Nov 23 '22 08:11 kemingy

TODO:

  • research conda
  • research UID

BTW, how to handle the envdlib in the new design? Will these be in a new subgraph, or use the default graph directly?

gaocegege avatar Nov 25 '22 02:11 gaocegege

BTW, how to handle the envdlib in the new design? Will these be in a new subgraph, or use the default graph directly?

Still the default graph, I think. It should be compatible.

kemingy avatar Nov 25 '22 02:11 kemingy

BTW, how to handle the envdlib in the new design? Will these be in a new subgraph, or use the default graph directly?

Still the default graph, I think. It should be compatible.

I have kind of concern about complexity in default graph. But we could refactor it after all ir features seems stable.

aseaday avatar Nov 25 '22 02:11 aseaday