accelerated-container-image icon indicating copy to clipboard operation
accelerated-container-image copied to clipboard

Consultation: About DADI's use business scenarios & implementation status

Open bengbeng-pp opened this issue 2 years ago • 10 comments

Excuse me, I have a few questions。 Does DADI have a user exchange group? Does DADI have any unsuitable business scenarios? Would it be convenient for you to inform Alibaba about the current status of DADI's landing?

bengbeng-pp avatar Jun 10 '22 02:06 bengbeng-pp

Does DADI have a user exchange group? yep, we create a channel named 'overlaybd' in CNCF on Slack today😂... Login to the workspace through https://slack.cncf.io/ and search 'overlaybd'.

Does DADI have any unsuitable business scenarios? Theoretically, overlaybd has higher compatibility than other image-format because we support any Filesystem. We strongly suggest you use overlaybd on production environments such as CentOS / Ubuntu / Aliyun Linux. It will be very easy to support.

Would it be convenient for you to inform Alibaba about the current status of DADI's landing? DADI is widely used in Alibaba Group for about 4 years. And now we support almost all container-related products... There are some public information: https://help.aliyun.com/document_detail/184556.html https://developer.aliyun.com/article/782918?utm_content=g_1000255233 https://www.sohu.com/a/504934204_612370

BigVan avatar Jun 10 '22 09:06 BigVan

Thanks for answering. Let me explain the second and third questions. In the process of actual implementation, although the image pull time has been accelerated, the time for container startup and business has become longer. I understand that this situation is inevitable. Now ask some practical experience about the official landing to help us implement DADI on a larger scale.

  1. How much influence does the on-demand loading method have on container startup and business startup, and is there any data on this?
  2. What is the impact of the on-demand loading method on container startup and business startup? For example, business scenarios, development languages, etc. We expect that based on this data, we will determine the priority and large-scale implementation of DADI scenarios.
  3. In addition, for this situation, besides trace prefetching, are there other optimization measures or optimization plans?

bengbeng-pp avatar Jun 14 '22 08:06 bengbeng-pp

I see... It seems that you are concerned about the on-demand loading will negatively affect the business startup time. The business startup time between local image and overlaybd is not very obvious depending on your registry latency and the IO pattern of your application. In most cases, the application will only load a few data during its life cycle. BTW, overlaybd kept the on-demand image data in its cache directory so you won't feel any problem on next time startup.

In our paper, there are some performance behavior in our production environment. There are MySQL benchmarks wish can help you.:

One more, DADI is widely used in machine learning and WebIDE in Alibaba cloud, but I think you should take your own test in your environment.

For question 3. currently, we don't have other optimization plans...

You can contact me by email ([email protected]) if you need more information or business landing help in your company :D

BigVan avatar Jun 14 '22 14:06 BigVan

@bengbeng-pp Currently in Alibaba Cloud, only the Function Compute uses trace prefetching, because it's relatively easier for them to record trace. Some business are reluctant to do such a thing.

I think what you need is Cache + P2P distribution. For each of them DADI has an open-source implementation. By setting up a large scale of SSD cluster, you basically distribute / cache every hot piece of data in the network and thus a mighty network filesystem is formed :-)

beef9999 avatar Jun 14 '22 15:06 beef9999

Thank you very much for your patience.

  1. The cold start in the above figure means that the application starts successfully? Is the data in the picture without the local cache?
  2. The business startup time between local image and overlaybd is not very obvious depending on your registry latency and the IO pattern of your application. What does the IO pattern of application refer to?
  3. Is it mainly used in machine learning and webide, why are these two scenarios, and has online business been implemented?
  4. Why only Function Compute uses trace prefetching? Trace prefetching may not be applicable to our scenario, because recording traces may cause the operation of business logic to modify business data.

other:

  1. We are now using the cache function, can you make a suggestion to improve the observability of the cache? For example, the local cache percentage of images and the cache hit rate.
  2. We haven't used the P2P function yet. P2P is in the planning. Do you have any data on performance optimization for P2P? Also, is it possible to monitor the speed of the download from the remote?

bengbeng-pp avatar Jun 16 '22 07:06 bengbeng-pp

  1. The cold start in the above figure means that the application starts successfully? Is the data in the picture without the local cache?

yes, the figure shows the cold startup time between tgz image and overlaybd

  1. The business startup time between local image and overlaybd is not very obvious depending on your registry latency and the IO pattern of your application. What does the IO pattern of application refer to?

'The IO pattern' I mean is most applications only use a few image data.( ~6.4% FAST 16 )

  1. Is it mainly used in machine learning and webide, why are these two scenarios, and has online business been implemented?

Actually, online business is the first landing scenario. Machine-learning and webIDE which I mentioned, always use larger images than others. (~10GB+)

  1. Why only Function Compute uses trace prefetching? Trace prefetching may not be applicable to our scenario, because recording traces may cause the operation of business logic to modify business data.

Overlaybd recored the image I/O trace without network. In my experience, trace prefetching should be helpful to you.

about cache usage.

we use LRU to auto evict unused cache data and it will never exceed the limit capacity. If you want to know the disk usage about cache, try 'du -sh' on the cache-dir.

aboout p2p

there is a very rudimentary open source code for our p2p... But I don't think you need it. https://github.com/data-accelerator/dadi-p2proxy

Anyway, as I said before, I can only tell you the conclusion from my experiences, you should take your own test. :-D

BigVan avatar Jun 16 '22 08:06 BigVan

Overlaybd recored the image I/O trace without network.

Will this cause the application to get stuck when network operations are required, and it is impossible to obtain a complete record of io operations.

bengbeng-pp avatar Jun 17 '22 07:06 bengbeng-pp

Yes.... the prefetch trace is based on the application environment. It can only accelerate at the beginning time of the container when recording without network.

BigVan avatar Jun 20 '22 09:06 BigVan

I understand, thank you very much for the answer

bengbeng-pp avatar Jun 21 '22 06:06 bengbeng-pp

@bengbeng-pp Currently in Alibaba Cloud, only the Function Compute uses trace prefetching, because it's relatively easier for them to record trace. Some business are reluctant to do such a thing.

I think what you need is Cache + P2P distribution. For each of them DADI has an open-source implementation. By setting up a large scale of SSD cluster, you basically distribute / cache every hot piece of data in the network and thus a mighty network filesystem is formed :-)

Hello,Is there any documentation on how to configure cache and p2p? When I pulled obd format image from registry, I can not see anything from /opt/overlaybd/registry_cache

dbfancier avatar Oct 26 '22 06:10 dbfancier