envd proposal(ir): state based implementation

Signed-off-by: Keming [email protected]

related to #91

Oct 07 '22 10:10 kemingy

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kemingy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kemingy]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 07 '22 10:10 muniu-bot[bot]

How to merge different different stages like llb.Merge?

Also I think bramble's syntax can be an option

Oct 07 '22 10:10 VoVAllen

How to merge different different stages like llb.Merge?

Will add Merge, Diff, File later.

Also I think bramble's syntax can be an option

Will take a look.

Oct 07 '22 10:10 kemingy

Some examples from bramble https://github.com/maxmcd/bramble/blob/eea4aee51e6ad881166412d61190012fb0d97c56/internal/project/testdata/project/default.bramble

Oct 07 '22 11:10 VoVAllen

I think we should have an idea about what the llb graph will look like after more dependency is set by the user (such as gcc and pypi packages), that can also utilize caches as much as possible.

Simple things should be simple, complex things should be possible.

Oct 09 '22 04:10 VoVAllen

I think bramble's example there looks complex.

https://github.com/maxmcd/bramble/blob/eea4aee51e6ad881166412d61190012fb0d97c56/internal/project/testdata/project/default.bramble

It declares the input arguments explicitly. Personally, I prefer the func chain.

Oct 10 '22 08:10 gaocegege

I think we should have an idea about what the llb graph will look like after more dependency is set by the user (such as gcc and pypi packages), that can also utilize caches as much as possible.

Simple things should be simple, complex things should be possible.

Agree. What we have now should be simple. Others like parallelism should be possible.

Method chaining should be enough for sequence commands. Each function should return a state. (ExecState should be hidden by introducing more parameters)

Oct 10 '22 08:10 kemingy

Some questions:

Is it possible to auto-merge the two chains? Merge/diff should be advanced statements.
How to integrate it with the envdlib?

Oct 10 '22 12:10 gaocegege

* Is it possible to auto-merge the two chains? Merge/diff should be advanced statements.

I think it's hard. (correct me if I'm wrong)

Starlark doesn't support operator overloading like conda_state + vscode_state. But I think we can introduce a new method like conda_state.merge([vscode_state]) if it's helpful.

Besides, we need explicitly use root.state() to get a copy if another branch is not built from scratch. Otherwise, we don't know when to diverge.

* How to integrate it with the envdlib?

Some ideas:

envdlib can provide functions built from scratch or from a user-provided state.
- from scratch: root.merge([envdlib.compile_rust_serving()])
- from a state: root = envdlib.tensorboard(conda_state) or root.apply(envdlib.tensorboard, host_port=9000) so users can continue chaining
We should provide Source.envd_python() which is equivalent to base(os="ubuntu20.04", language="python3").

Oct 11 '22 08:10 kemingy

Then could the new language syntax be compatible with the existing design?

Or is it a total breaking change?

BTW, could you please provide the example for python-basic with the new design?

Oct 12 '22 05:10 gaocegege

Then could the new language syntax be compatible with the existing design?

Or is it a total breaking change?

I think it will be a breaking change.

BTW, could you please provide the example for python-basic with the new design?

Already been added to the proposal. PTAL.

Oct 12 '22 05:10 kemingy

@VoVAllen WDYT

I have no opinion on it, let's start researching if starlark supports it.

Oct 12 '22 07:10 gaocegege

I'm a bit concerned about the current proposal. The current design is detail-oriented, which is more complicated than original design. Also current design looks similar to llb, we can also consider expose llb-like primitives directly. Explicit dependency declaration is an advanced function, llb primitives would be easier for us to maintain and ensures that "complex thing is possible"

Some personal thoughts: Explicitly define two/three stages. base stages, envd-managed stages(install.python_packages etc.), user-managed stages(run(XXX))

The difference between them is:

base stages can be overwritten by custom images, and managed by envd if not specified
envd-managed stages will parallelize and use cache as much as possible to accelerate the process, thus no dependency can be set here.
user-managed stages can be fully customized, with explicit dependency.

Other ideas:

All functions provided by envd can add a new argument, such as called state.

In user stages user can do:

state = stage('user')
state1 = install.apt_packages(["g++"], state=state)
new_state = install.python_packages(["package_needs_g++"], state=state1)

and to define it as a custom function:

def install_inhouse_package():
  state = stage('user')
  state1 = install.apt_packages(["g++"], state=state)
  new_state = install.python_packages(["package_needs_g++"], state=state1)
  # envd_output is an builtin variable, add means merge state with the final output
  envd_output.add(new_state)
  return new_state

To use

def build():
   install.python_packages(['torch'])
   install_inhouse_packages()

WDYT

Oct 12 '22 08:10 VoVAllen

I'm a bit concerned about the current proposal. The current design is detail-oriented, which is more complicated than original design. Also current design looks similar to llb, we can also consider expose llb-like primitives directly. Explicit dependency declaration is an advanced function, llb primitives would be easier for us to maintain and ensures that "complex thing is possible"

Some personal thoughts: Explicitly define two/three stages. base stages, envd-managed stages(install.python_packages etc.), user-managed stages(run(XXX))

The difference between them is:
* base stages can be overwritten by custom images, and managed by envd if not specified

* envd-managed stages will parallelize and use cache as much as possible to accelerate the process, thus no dependency can be set here.

* user-managed stages can be fully customized, with explicit dependency.

This is similar to the current implementation and this proposal. We do have different stages, it's just not explicit.

We can provide the install.conda_python() function. So users who start with the custom images can use it to install the python environment.

Other ideas:

All functions provided by envd can add a new argument, such as called state.

In user stages user can do:

state = stage('user')
state1 = install.apt_packages(["g++"], state=state)
new_state = install.python_packages(["package_needs_g++"], state=state1)

and to define it as a custom function:

def install_inhouse_package():
  state = stage('user')
  state1 = install.apt_packages(["g++"], state=state)
  new_state = install.python_packages(["package_needs_g++"], state=state1)
  # envd_output is an builtin variable, add means merge state with the final output
  envd_output.add(new_state)
  return new_state

To use

def build():
   install.python_packages(['torch'])
   install_inhouse_packages()

WDYT

Defining the dependencies with an extra state argument is acceptable but not very user-friendly.

The LLB-like syntax is only complex when you need to use diff and merge. Otherwise, the method chaining should be a simple solution.

Oct 12 '22 08:10 kemingy

One more thing, this is incompatible with config.envd.

Oct 13 '22 00:10 kemingy

envd envd copied to clipboard

proposal(ir): state based implementation

envd
envd copied to clipboard