envd icon indicating copy to clipboard operation
envd copied to clipboard

feat(docker): Support buildkitd moby worker in dockerd 22.06-beta

Open gaocegege opened this issue 3 years ago β€’ 36 comments
trafficstars

Description

[+] Building 103.2s (11/11) FINISHED                                                                                                                                                                       
 => docker-image://docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04                                                                                                                              2.8s
 => => resolve docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04                                                                                                                                  2.6s
 => local://context                                                                                                                                                                                   3.1s
 => => transferring context: 50.52kB                                                                                                                                                                  0.1s
 => CACHED sh -c apt-get update && apt-get install -y --no-install-recommends python3 python3-pip                                                                                                     0.0s
 => CACHED apt install gcc                                                                                                                                                                            0.0s
 => CACHED pip install -i https://mirror.sjtu.edu.cn/pypi/web/simple jupyter ormb                                                                                                                     0.0s
 => CACHED mkdir /var/midi/remote                                                                                                                                                                     0.0s
 => CACHED mkdir /var/midi/bin                                                                                                                                                                        0.0s
 => CACHED copy /examples/ssh_keypairs/public.pub /var/midi/remote/authorized_keys                                                                                                                    0.0s
 => CACHED copy /bin/midi-ssh /var/midi/bin/midi-ssh                                                                                                                                                  0.0s
 => CACHED merge (apt install gcc, pip install -i https://mirror.sjtu.edu.cn/pypi/web/simple jupyter ormb, copy /bin/midi-ssh /var/midi/bin/midi-ssh)                                                 0.0s
 => exporting to oci image format                                                                                                                                                                    96.6s
 => => exporting layers                                                                                                                                                                               0.0s
 => => exporting manifest sha256:32bbf82b70e17ca70b11b88310ef02450db2bed3b94765415650a2227baa63cf                                                                                                     3.1s
 => => exporting config sha256:9bcaf4d291970033f2a6316dbf11912e77de402c81f0b11896f16c8bab19360b                                                                                                       1.2s
 => => sending tarball                                                                                                                                                                               91.6s

The image is built in buildkit, and it does not exist in the docker host. Thus we need to pipe the buildkit build image into the docker host. It takes about 100s for a 20G base image docker load. It is too slow. We need to optimize it.

gaocegege avatar Apr 24 '22 06:04 gaocegege

We merged multiple base layers into the image, thus the size is large. diff should be used to reduce the size. Ref https://github.com/tensorchord/MIDI/pull/54/commits/e64786b91c821b250b06d228e8cbc5be0bcad59d

gaocegege avatar Apr 25 '22 03:04 gaocegege

The base image nvidia:cuda:11.2-devel-ubuntu2004 is 4GB, but our base image is 7GB. We should figure out where the 3GB comes from

gaocegege avatar Apr 25 '22 07:04 gaocegege

Perhaps we could make MIDI work in a way similar to docker-buildx, which uses the BuildKit library bundled into the Docker daemon with docker driver, so that the image is actually built by dockerd and we don't need to load the image manually.

knight42 avatar Apr 25 '22 09:04 knight42

Cool! I think it is a great idea. Let's investigate how docker buildx plugin does.

gaocegege avatar Apr 25 '22 11:04 gaocegege

buildx does not work in the local docker daemon too. We need to specify --load to load the artifact into docker.

But there is an optimization we could use:

$ docker volume inspect
[
    {
        "CreatedAt": "2022-05-05T17:30:23+08:00",
        "Driver": "local",
        "Labels": null,
        "Mountpoint": "/var/lib/docker/volumes/buildx_buildkit_amazing_albattani0_state/_data",
        "Name": "buildx_buildkit_amazing_albattani0_state",
        "Options": null,
        "Scope": "local"
    }
]

$ docker inspect container
        {
          "Type": "volume",
          "Source": "buildx_buildkit_amazing_albattani0_state",
          "Target": "/var/lib/buildkit"
        }

We could create a volume to keep the cache persistent.

gaocegege avatar May 07 '22 11:05 gaocegege

Ref https://github.com/docker/buildx/blob/3adca1c17db9439fb94499ccef62f6a40dcaf899/build/build.go#L1354:#L1375

Buildx also relies on docker load.

	w = &waitingWriter{
		PipeWriter: pw,
		f: func() {
			resp, err := c.ImageLoad(ctx, pr, false)
			defer close(done)
			if err != nil {
				pr.CloseWithError(err)
				w.mu.Lock()
				w.err = err
				w.mu.Unlock()
				return
			}
			prog := progress.WithPrefix(status, "", false)
			progress.FromReader(prog, "importing to docker", resp.Body)
		},
		done:   done,
		cancel: cancel,
	}
	return w, func() {
		pr.Close()
	}, nil
}

gaocegege avatar May 08 '22 11:05 gaocegege

Now we use diff and merge to reduce the size. But docker load is still slow

7GB image load takes ~17s

gaocegege avatar May 08 '22 11:05 gaocegege

7GB image load takes ~17s

Wow! It is really awsome! πŸ’―

knight42 avatar May 08 '22 16:05 knight42

Yeah, It comes from containerd diff, I think.

Maybe we can have a look at how docker build does when loading the image into its local image store.

gaocegege avatar May 09 '22 00:05 gaocegege

I found the sending tarball still take ~30s on my machine. Is this expected? image

VoVAllen avatar May 10 '22 07:05 VoVAllen

Yeah, it is expected in the current design. send tarball is the docker load process

gaocegege avatar May 10 '22 08:05 gaocegege

I dived a little deeper into buildx and I am sure that docker load is not necessarily required by buildx. Excerpt from the official documentaion: https://docs.docker.com/engine/reference/commandline/buildx_create/#driver

docker driver Uses the builder that is built into the docker daemon. With this driver, the --load flag is implied by default on buildx build. However, building multi-platform images or exporting cache is not currently supported.

A PoC is available as well: https://gist.github.com/knight42/6c128a2edf7cebcb6816343da833295a. The built image is present in docker images without docker load.

After learning about that, I have been trying to get rid of docker load in envd, but it is unfortunate that the version of bundled buildkitd in docker engine 20.10.14 is v0.8.3-4-gbc07b2b8, while mergeop is introduced in v0.10.3.

That said, even though the bundled builkitd in docker might be new enough to support mergeop in the future, I think we still need some fallback mechanism, like using docker-container driver as what we did now.

knight42 avatar May 15 '22 15:05 knight42

@knight42 Thanks for the research!

but it is unfortunate that the version of bundled buildkitd in docker engine 20.10.14 is v0.8.3-4-gbc07b2b8, while mergeop is introduced in v0.10.3

I am wondering why we should use docker 20.10.14, is it the version that supports built-in load?

gaocegege avatar May 16 '22 00:05 gaocegege

while mergeop is introduced in v0.10.3.

Currently, we use buildkit v0.10.1, and merge op is supported in this version. I am not sure if it only works after v0.10.3 :thinking:

gaocegege avatar May 16 '22 01:05 gaocegege

Got the problem here.

failed to solve LLB: failed to solve: failed to load LLB: unknown API capability mergeop

The client returns the error that we cannot use merge op if we eliminate docker load.

gaocegege avatar May 16 '22 06:05 gaocegege

Ref https://github.com/docker/buildx/issues/1132

gaocegege avatar May 16 '22 07:05 gaocegege

I am wondering why we should use docker 20.10.14, is it the version that supports built-in load?

Nope, it is just the version of the dockerd in my laptop..

while mergeop is introduced in v0.10.3.

Currently, we use buildkit v0.10.1, and merge op is supported in this version. I am not sure if it only works after v0.10.3 πŸ€”

Sorry I double-checked the MergeOp PR, the merge op is actually introduced in v0.10.0.

The client returns the error that we cannot use merge op if we eliminate docker load.

Yeah, since we heavily leverage merge op in envd, if we want to get rid of docker load, we need to make sure the bundled buildkitd in dockerd has the support of merge op.

knight42 avatar May 16 '22 13:05 knight42

20.10.16 still uses v0.8.3-4-gbc07b2b8. I am afraid that we need to wait until the next milestone of docker.

https://github.com/moby/moby/blob/v20.10.16/vendor.conf#L36

gaocegege avatar May 17 '22 06:05 gaocegege

Things that we need to confirm:

  • How buildkit stores layers
  • How buildx loads images into the local image store
  • How buildkit is initialized in docker
  • What is changed in earthly customized buildkit? https://github.com/moby/buildkit/compare/master...earthly:earthly-main

gaocegege avatar May 20 '22 05:05 gaocegege

Using docker buildkitd directly is not possible now. We will figure out if we can mount some dir in the envd_buildkitd container to achieve a similar experience.

gaocegege avatar May 20 '22 05:05 gaocegege

Here https://github.com/moby/moby/blob/master/builder/builder-next/controller.go#L44:#L220 docker creates a buildkit daemon (control.Controller).

And the most important part is https://github.com/moby/moby/blob/master/builder/builder-next/worker/worker.go#L83 . Docker has a new worker type moby

gaocegege avatar May 20 '22 13:05 gaocegege

	bk, err := buildkit.New(buildkit.Opt{
		SessionManager:      sm,
		Root:                filepath.Join(config.Root, "buildkit"),
		Dist:                d.DistributionServices(),
		NetworkController:   d.NetworkController(),
		DefaultCgroupParent: cgroupParent,
		RegistryHosts:       d.RegistryHosts(),
		BuilderConfig:       config.Builder,
		Rootless:            d.Rootless(),
		IdentityMapping:     d.IdentityMapping(),
		DNSConfig:           config.DNSConfig,
		ApparmorProfile:     daemon.DefaultApparmorProfile(),
	})

buildkit (builder-next.Controller) uses dockerd's DistributionService, thus the images' blobs and metadata are stored in docker image store directly. Thus there is no need to load images.

gaocegege avatar May 21 '22 02:05 gaocegege

https://github.com/moby/moby/issues/9935#issuecomment-297586182

The maintainers said it is not possible to run multi docker daemons on one data root /var/lib/docker

gaocegege avatar May 21 '22 02:05 gaocegege

The Docker daemon was explicitly designed to have exclusive access to /var/lib/docker. Nothing else should touch, poke, or tickle any of the Docker files hidden there.

Why is that? It’s one of the hard learned lessons from the dotCloud days. The dotCloud container engine worked by having multiple processes accessing /var/lib/dotcloud simultaneously. Clever tricks like atomic file replacement (instead of in-place editing), peppering the code with advisory and mandatory locking, and other experiments with safe-ish systems like SQLite and BDB only got us so far; and when we refactored our container engine (which eventually became Docker) one of the big design decisions was to gather all the container operations under a single daemon and be done with all that concurrent access nonsense.

This means that if you share your /var/lib/docker directory between multiple Docker instances, you’re gonna have a bad time. Of course, it might work, especially during early testing. β€œLook ma, I can docker run ubuntu!” But try to do something more involved (pull the same image from two different instances…) and watch the world burn.

gaocegege avatar May 21 '22 02:05 gaocegege

But, I still think it is possible to run a (minimal) docker daemon in our envd_buildkitd container. /var/lib/docker/image is only needed in envd_buildkitd.

The main concern above is that multiple daemons on the same data root (/var/lib/docker/) may break the consistency. Let's have a look at the dir architecture of the image part in /var/lib/docker

/var/lib/docker/image/overlay2
β”œβ”€β”€ distribution
β”‚Β Β  β”œβ”€β”€ diffid-by-digest
β”‚Β Β  └── v2metadata-by-diffid
β”œβ”€β”€ imagedb
β”‚Β Β  β”œβ”€β”€ content
β”‚Β Β  └── metadata
β”œβ”€β”€ layerdb
β”‚Β Β  β”œβ”€β”€ mounts
β”‚Β Β  β”œβ”€β”€ sha256
β”‚Β Β  └── tmp
└── repositories.json

distribution is used to communicate with OCI image registry, thus it is not used in envd_buildkitd. imagedb and layerdb are actually a key-value store and the key is the file name (which is a HEX). Then it should not be affected by concurrent daemons.

The last one repositories.json is to store the map from image tag name to image ID:

{
    "ubuntu": {
        "ubuntu:20.04": "sha256:53df61775e8856a464ca52d4cd9eabbf4eb3ceedbde5afecc57e417e7b7155d5",
        "ubuntu@sha256:47f14534bda344d9fe6ffd6effb95eefe579f4be0d508b7445cf77f61a0e5724": "sha256:53df61775e8856a464ca52d4cd9eabbf4eb3ceedbde5afecc57e417e7b7155d5"
    }
}

It may be affected by the concurrent daemons. But we may have some workarounds for it. For example, we can rename the image using docker API, and not tag it in the low level. We avoid manipulating this JSON file directly.

Thus in the buildkit exporter code, we should remove logic like this:

	if e.opt.ReferenceStore != nil {
		targetNames := strings.Split(e.targetName, ",")
		for _, targetName := range targetNames {
			tagDone := oneOffProgress(ctx, "naming to "+targetName)
			tref, err := distref.ParseNormalizedNamed(targetName)
			if err != nil {
				return nil, err
			}
			if err := e.opt.ReferenceStore.AddTag(tref, digest.Digest(id), true); err != nil {
				return nil, tagDone(err)
			}
			_ = tagDone(nil)
		}
	}

Or we set ReferenceStore to nil

gaocegege avatar May 21 '22 03:05 gaocegege

We can create the image service outside of docker. Then I will try if it is possible to embed it into the buildkitd process.

package main

import (
	"context"
	"path/filepath"

	"github.com/docker/docker/api/types"
	_ "github.com/docker/docker/daemon/graphdriver/overlay2"
	"github.com/docker/docker/daemon/images"
	dmetadata "github.com/docker/docker/distribution/metadata"
	"github.com/docker/docker/image"
	"github.com/docker/docker/layer"
	"github.com/docker/docker/pkg/idtools"
	refstore "github.com/docker/docker/reference"
)

func main() {
	root := "/var/lib/docker"
	graphDriver := "overlay2"
	layerStore, err := layer.NewStoreFromOptions(layer.StoreOptions{
		Root:                      root,
		MetadataStorePathTemplate: filepath.Join(root, "image", "%s", "layerdb"),
		GraphDriver:               graphDriver,
		GraphDriverOptions:        []string{},
		IDMapping:                 idtools.IdentityMapping{},
		ExperimentalEnabled:       false,
	})
	if err != nil {
		panic(err)
	}
	m := layerStore.Map()
	for k, v := range m {
		println(k, v)
	}
	imageRoot := filepath.Join(root, "image", graphDriver)
	ifs, err := image.NewFSStoreBackend(filepath.Join(imageRoot, "imagedb"))
	if err != nil {
		panic(err)
	}

	imageStore, err := image.NewImageStore(ifs, layerStore)
	if err != nil {
		panic(err)
	}
	im := imageStore.Map()
	for k, v := range im {
		println(k, v.Size)
	}

	refStoreLocation := filepath.Join(imageRoot, `repositories.json`)
	rs, err := refstore.NewReferenceStore(refStoreLocation)
	if err != nil {
		panic(err)
	}
	_ = rs
	distributionMetadataStore, err := dmetadata.NewFSMetadataStore(filepath.Join(imageRoot, "distribution"))
	if err != nil {
		panic(err)
	}
	_ = distributionMetadataStore
	imgSvcConfig := images.ImageServiceConfig{
		DistributionMetadataStore: distributionMetadataStore,
		ImageStore:                imageStore,
		LayerStore:                layerStore,
		ReferenceStore:            rs,
	}
	imageService := images.NewImageService(imgSvcConfig)
	is, err := imageService.Images(context.TODO(), types.ImageListOptions{})
	imageService.DistributionServices()
	if err != nil {
		panic(err)
	}
	for _, i := range is {
		println(i.ID)
	}
}

gaocegege avatar May 21 '22 11:05 gaocegege

https://github.com/tensorchord/buildkit/pull/1/files

I am working on it. It is not easy.. :worried:

Things we may need to change:

  • source/containerimage/pull.go to use docker/docker/image.Store
  • worker

gaocegege avatar May 22 '22 01:05 gaocegege

nerdctl uses a newer buildkit https://github.com/containerd/nerdctl/blob/e77e05b5fd252274e3727e0439e9a2d45622ccb9/Dockerfile.d/SHA256SUMS.d/buildkit-v0.10.3. Can we leverage this?

VoVAllen avatar May 22 '22 16:05 VoVAllen

nerdctl uses a newer buildkit https://github.com/containerd/nerdctl/blob/e77e05b5fd252274e3727e0439e9a2d45622ccb9/Dockerfile.d/SHA256SUMS.d/buildkit-v0.10.3. Can we leverage this?

We are using newer than nerdctl.

gaocegege avatar May 23 '22 02:05 gaocegege

It is possible! https://github.com/gaocegege/buildkit/pull/1/files reuses the docker image store when caching the docker image in the buildkit.

Builtkit instance in the container owns its own image cache. This PR reuses /var/lib/docker/overlay2/image/ instead of using its own separate cache.

gaocegege avatar May 23 '22 12:05 gaocegege