community icon indicating copy to clipboard operation
community copied to clipboard

Proposal: Image Acceleration(Apparate)

Open kofj opened this issue 3 years ago • 34 comments

WIP. Update later.

Signed-off-by: fanjiankong [email protected]

kofj avatar May 24 '21 09:05 kofj

This proposal sounds conceptually very similar to https://github.com/containerd/stargz-snapshotter :thinking:

tianon avatar Jun 02 '21 17:06 tianon

And very even more similar to nydus image acceleration service: https://github.com/dragonflyoss/image-service

We've been discussing with Harbor team to create a pluggable image conversion mechanism that works for different image formats (currently nydus and estargz included). Maybe Apparate can join the force as well ;)

/cc @ktock

bergwolf avatar Jun 03 '21 08:06 bergwolf

what's the difference between https://github.com/dragonflyoss/image-service and this one?

xujihui1985 avatar Jun 03 '21 08:06 xujihui1985

We've been discussing with Harbor team to create a pluggable image conversion mechanism that works for different image formats (currently nydus and estargz included). Maybe Apparate can join the force as well ;)

:+1:

Recently a variety of image formats are discussed in the community (e.g. nydus, estargz, zstd:chunked...) not only Apparate, so it would be great to have a generic (and pluggable) conversion mechanism that works for them.

ktock avatar Jun 03 '21 08:06 ktock

A pluggable image conversion mechanism has also been proposed here: https://github.com/goharbor/community/pull/167 We can participate in the discussion together. :)

imeoer avatar Jun 03 '21 09:06 imeoer

It seems like another image-service (https://github.com/dragonflyoss/image-service), and another stargz (https://github.com/containerd/stargz-snapshotter). I think Apparate, image-service and some other new image formats are based on or the extension of stargz for they look quite similar. It is better to make stargz as a standard and other implementations keep compatible with stargz and develop their own features.

ghost avatar Jun 03 '21 15:06 ghost

@lovecontainers Standardization of lazy pulling in the current version of OCI Image Spec (v1) is discussed in https://github.com/opencontainers/image-spec/issues/815. nydus is proposed to the next version of OCI Image Spec (a.k.a. OCIv2). c.f. https://www.cncf.io/blog/2020/10/20/introducing-nydus-dragonfly-container-image-service/

ktock avatar Jun 04 '21 00:06 ktock

@ktock yeah, I hope for the next oci spec. but nydus looks quite similar to stargz as it illustrated in the doc that nydus is a improvement of stargz. In fact, almost all newer remote image formats looks the same. So I think maybe is a better way to bring up the stargz v2 rather than so many stargz liked ones. At this moment, widely disscussion is necessary, but repeated ones are meaningless.

ghost avatar Jun 04 '21 08:06 ghost

@lovecontainers Yes, repeated ones are meaningless. And there's a novel solution open-sourced recently: https://github.com/alibaba/overlaybd https://github.com/alibaba/accelerated-container-image https://www.usenix.org/conference/atc20/presentation/li-huiba https://www.usenix.org/conference/atc21/presentation/wang-ao

lihuiba avatar Jun 04 '21 08:06 lihuiba

@lihuiba thank u, this is my first time learned about overlaybd for I am a beginner of containers. it looks like traditional vm image and native friendly to remote access. The most interesting point for me is your implementation deos not depends on FUSE.

ghost avatar Jun 04 '21 10:06 ghost

It seems like another image-service (https://github.com/dragonflyoss/image-service), and another stargz (https://github.com/containerd/stargz-snapshotter). I think Apparate, image-service and some other new image formats are based on or the extension of stargz for they look quite similar. It is better to make stargz as a standard and other implementations keep compatible with stargz and develop their own features.

There's a fundamental difference between stargz and nydus:) Nydus could be thought as a file system over object storage and has a split fs metadata/data design, so different images could share data blob objects.

jiangliu avatar Jun 04 '21 10:06 jiangliu

@malc0lm Pls subscribe and discuss here, we need to answer questions from the community.

kofj avatar Jun 04 '21 11:06 kofj

It seems like another image-service (https://github.com/dragonflyoss/image-service), and another stargz (https://github.com/containerd/stargz-snapshotter). I think Apparate, image-service and some other new image formats are based on or the extension of stargz for they look quite similar. It is better to make stargz as a standard and other implementations keep compatible with stargz and develop their own features.

There's a fundamental differen between stargz and nydus:) Nydus could be thought as a file system over object storage and has a split fs metadata/data design, so different images could share data blob objects.

it is really a great improvement. it is hard to say a fundamental difference, and also the Apparatus. these similar propsosals may have competitions for business, for they stand for different companies, but make no sense for community reaching an agreement of next oci .

ghost avatar Jun 04 '21 17:06 ghost

@lovecontainers @tianon Overlaybd is a combination of container image and VM image. It is a layered image in form of block device. It doesn't depends on FUSE / virtio-fs. I believe this design gathers the best of both worlds (container and VM), and it is applicable to both worlds.

lihuiba avatar Jun 05 '21 03:06 lihuiba

@lovecontainers @tianon Overlaybd is a combination of container image and VM image. It is a layered image in form of block device. It doesn't depends on FUSE / virtio-fs. I believe this design gathers the best of both worlds (container and VM), and it is applicable to both worlds.

interesting,good for you, you are so funny

xujihui1985 avatar Jun 05 '21 13:06 xujihui1985

@lovecontainers @tianon Overlaybd is a combination of container image and VM image. It is a layered image in form of block device. It doesn't depends on FUSE / virtio-fs. I believe this design gathers the best of both worlds (container and VM), and it is applicable to both worlds.

obviously filesystem has higher abstract level than block device, which means, more business value can be added on top of it, and I don’t understand what makes you think this overlaybd thing is best of the world because it is not depend on fuse and virtiofs. But you depend on TCM which is another ko, so what is the advantage? You are welcome if you identify the pros and cons of different approach, instead you keep saying you are the best and others are meaningless which make me feel disgusting.

xujihui1985 avatar Jun 05 '21 13:06 xujihui1985

@xujihui1985 Hi, jihui. "It doesn't depends on FUSE / virtio-fs" is just a statement of fact, and a confirmation to lovecontainers. The reasons why I believe overlaybd is the best is complicated, and I suggest you read the papers above mentioned. There are paragraphs discussing this topic. Thanks!

lihuiba avatar Jun 07 '21 02:06 lihuiba

@xujihui1985 Higher abstraction level doesn't necessarily mean better solution. For example, Python is a higher-level language than Java or C/C++, but Python is not necessarily better in every aspect. The best (-fit) abstractions vary in difference scenarios. The abstraction of block device doesn't preclude a file system abstraction on top of it. Actually, we have made an internal solution that includes an enhanced file system, called rofs, atop overlaybd. This solution unleashes all the imaginations about the file system abstraction, while retaining the advantages of block device, i.g. simplicity and efficiency.

lihuiba avatar Jun 07 '21 03:06 lihuiba

@xujihui1985 Higher abstraction level doesn't necessarily mean better solution. For example, Python is a higher-level language than Java or C/C++, but Python is not necessarily better in every aspect. The best (-fit) abstractions vary in difference scenarios. The abstraction of block device doesn't preclude a file system abstraction on top of it. Actually, we have made an internal solution that includes an enhanced file system, called rofs, atop overlaybd. This solution unleashes all the imaginations about the file system abstraction, while retaining the advantages of block device, i.g. simplicity and efficiency.

@lihuiba I don't get this metaphor, what's the matter with python? 😂 and I'm pleased to know you are working on a solution of filesystem. welcome to join the force. :)

xujihui1985 avatar Jun 07 '21 03:06 xujihui1985

@lovecontainers @tianon Overlaybd is a combination of container image and VM image. It is a layered image in form of block device. It doesn't depends on FUSE / virtio-fs. I believe this design gathers the best of both worlds (container and VM), and it is applicable to both worlds.

obviously filesystem has higher abstract level than block device, which means, more business value can be added on top of it, and I don’t understand what makes you think this overlaybd thing is best of the world because it is not depend on fuse and virtiofs. But you depend on TCM which is another ko, so what is the advantage? You are welcome if you identify the pros and cons of different approach, instead you keep saying you are the best and others are meaningless which make me feel disgusting.

@xujihui1985 I did some research on the basis of stargz, and really felt the bottleneck of FUSE, in both performance and stability. Did FUSE have any alternatives? or does nydus has some improvements on that ( no related statement found in nydus docs)?

ghost avatar Jun 07 '21 03:06 ghost

@lovecontainers My team is also trying to improve fuse's performance, and we have an up-coming paper on this topic: https://www.usenix.org/conference/atc21/presentation/hsu .

But there's one more thing to solve: failure recovery. If fuse server process crashes, or gets killed, the file system instance may not recovery.

These problems (perforamce, fault-tolerance, etc.) do not exist in overlaybd.

lihuiba avatar Jun 07 '21 03:06 lihuiba

@lovecontainers @tianon Overlaybd is a combination of container image and VM image. It is a layered image in form of block device. It doesn't depends on FUSE / virtio-fs. I believe this design gathers the best of both worlds (container and VM), and it is applicable to both worlds.

obviously filesystem has higher abstract level than block device, which means, more business value can be added on top of it, and I don’t understand what makes you think this overlaybd thing is best of the world because it is not depend on fuse and virtiofs. But you depend on TCM which is another ko, so what is the advantage? You are welcome if you identify the pros and cons of different approach, instead you keep saying you are the best and others are meaningless which make me feel disgusting.

@xujihui1985 I did some research on the basis of stargz, and really felt the bottleneck of FUSE, in both performance and stability. Did FUSE have any alternatives? or does nydus has some improvements on that ( no related statement found in nydus docs)?

At early stage of developing fs based image acceleration technologies, FUSE is a good choice. When the technology becomes mature, an in kernel read only fs may be better solution. And nydus aims to become an in kernel fs:)

jiangliu avatar Jun 07 '21 03:06 jiangliu

@lovecontainers @tianon Overlaybd is a combination of container image and VM image. It is a layered image in form of block device. It doesn't depends on FUSE / virtio-fs. I believe this design gathers the best of both worlds (container and VM), and it is applicable to both worlds.

obviously filesystem has higher abstract level than block device, which means, more business value can be added on top of it, and I don’t understand what makes you think this overlaybd thing is best of the world because it is not depend on fuse and virtiofs. But you depend on TCM which is another ko, so what is the advantage? You are welcome if you identify the pros and cons of different approach, instead you keep saying you are the best and others are meaningless which make me feel disgusting.

@xujihui1985 I did some research on the basis of stargz, and really felt the bottleneck of FUSE, in both performance and stability. Did FUSE have any alternatives? or does nydus has some improvements on that ( no related statement found in nydus docs)?

@lovecontainers FUSE is not the problem of bottleneck, the problem is how to use fuse, the pros of stargz is the compatibility with targz, this is realy good one, the problem IMO is

  1. each layers of stargz image will mount as a fuse mountpoint, and these layers then combine to overlayfs.
  2. the toc must be fully load into rss for index inode, even the inode may never been read, that cause high memory footprint.

What nydus does to improve is to do "overlay" in build stage, and build the final view of root fs in metadata, so that one fuse mountpoint per image, underlying blob file is shared. Instead of loading entire toc index into rss memory, nydus build a inode table in the header of metadata, so only a small portion of memory is needed during the startup, you can refer to the detailed design doc here https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md

xujihui1985 avatar Jun 07 '21 06:06 xujihui1985

@lovecontainers @tianon Overlaybd is a combination of container image and VM image. It is a layered image in form of block device. It doesn't depends on FUSE / virtio-fs. I believe this design gathers the best of both worlds (container and VM), and it is applicable to both worlds.

obviously filesystem has higher abstract level than block device, which means, more business value can be added on top of it, and I don’t understand what makes you think this overlaybd thing is best of the world because it is not depend on fuse and virtiofs. But you depend on TCM which is another ko, so what is the advantage? You are welcome if you identify the pros and cons of different approach, instead you keep saying you are the best and others are meaningless which make me feel disgusting.

@xujihui1985 I did some research on the basis of stargz, and really felt the bottleneck of FUSE, in both performance and stability. Did FUSE have any alternatives? or does nydus has some improvements on that ( no related statement found in nydus docs)?

@lovecontainers FUSE is not the problem of bottleneck, the problem is how to use fuse, the pros of stargz is the compatibility with targz, this is realy good one, the problem IMO is

  1. each layers of stargz image will mount as a fuse mountpoint, and these layers then combine to overlayfs.
  2. the toc must be fully load into rss for index inode, even the inode may never been read, that cause high memory footprint.

What nydus does to improve is to do "overlay" in build stage, and build the final view of root fs in metadata, so that one fuse mountpoint per image, underlying blob file is shared. Instead of loading entire toc index into rss memory, nydus build a inode table in the header of metadata, so only a small portion of memory is needed during the startup, you can refer to the detailed design doc here https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md

yeah, I have already tried something similar to your solutions. thank you.

ghost avatar Jun 07 '21 14:06 ghost

@kofj where is the git repository of Apparate? I am curious about Apparate's solution of recovering fuse process :)

ghost avatar Jun 07 '21 14:06 ghost

An important goal of this proposal is to create a vendor-neutral sub-project in the goharbor community.

kofj avatar Jun 07 '21 15:06 kofj

@lovecontainers Sorry, there is no Apparate repository in github currently. Recovering fuse process is core ability for Apparate. First, fuse in userspace and kernel fuse module use /dev/fuse fd to communitcate, so it must separate fuse process and holding fd process. And we also need fuse request tracing in case of io hang in recovering. Finally, in read/write fuse filesystem, we also need record opened fd.

malc0lm avatar Jun 18 '21 06:06 malc0lm

@lovecontainers Sorry, there is no Apparate repository in github currently. Recovering fuse process is core ability for Apparate. First, fuse in userspace and kernel fuse module use /dev/fuse fd to communitcate, so it must separate fuse process and holding fd process. And we also need fuse request tracing in case of io hang in recovering. Finally, in read/write fuse filesystem, we also need record opened fd.

looking forward to see your implementation on github

ghost avatar Jun 21 '21 01:06 ghost

repo is here: https://github.com/goharbor/acceleration-service

OrlinVasilev avatar Apr 06 '22 13:04 OrlinVasilev

slack channel https://cloud-native.slack.com/archives/C01U31AK2LX

OrlinVasilev avatar Apr 06 '22 13:04 OrlinVasilev