envd icon indicating copy to clipboard operation
envd copied to clipboard

research: OCI artifacts

Open VoVAllen opened this issue 2 years ago • 3 comments

Description

Using oci artifact standard to store the artifacts/models when developing ML models


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

VoVAllen avatar Nov 17 '22 08:11 VoVAllen

Summary

ormb is a product which used to packet a model into an OCI artifact. However, it will become so slow when model size up to such as 20GB and more.

In this research, we will investigate the cons of ormb, why it is so slow for large models, and how to make better OCI storage.

Review of S3

S3 upload endpoint uses a ContentMd5 to validate files.

Content-MD5

The base64-encoded 128-bit MD5 digest of the message (without the headers) according to RFC 1864. This header can be used as a message integrity check to verify that the data is the same data that was originally sent. Although it is optional, we recommend using the Content-MD5 mechanism as an end-to-end integrity check. For more information about REST request authentication, see REST Authentication.

graph LR
1[calculate md5 local] --> 2[upload file and md5]
2[upload file and md5] --> 3[calculate md5 remote]
3[calculate md5 remote] --> 4[validate]
graph LR
1[upload file without md5] --> 2[not validate]

Improve ormb tool

1 - no compression

ormb makes a gzip compression to model, which consume 70%-75% time of the whole procedure.

mediaType string

This descriptor property has additional restrictions for . Implementations MUST support at least the following media types:layers[]

Manifests concerned with portability SHOULD use one of the above media types. An encountered that is unknown to the implementation MUST be ignored.mediaType

Entries in this field will frequently use the types.+gzip

OCI MediaType support unzipped layer without gzip, this compression could be removed totally. Whether we could get rid of tar is not clear.

2 - hash once

sha256 calculation cost 20% time of the whole procedure.

graph LR
1[calculate sha256 local at ormb] --> 2[calling oras with sha256]
2[calling oras with sha256] --> 3[calculate sha256 local at oras]
3[calculate sha256 local at oras] --> 4[validate]
4[validate] --> 5[Commit]

ormb calculates sha256 hash twice locally. As the local copy is almost error-free, this might be unnecessary.

oras could be called with no argument Digest, the procedure will become that.

graph LR
1[calling oras without sha256] --> 2[calculate sha256 local at oras]
2[calculate sha256 local at oras] --> 3[Commit]

Thus, time cost of sha256 can be halved.

3 - new hash algorithm

From Speed Hashing, we could see MD5 is much faster than SHA-256:

MD5 23070.7 M/s
SHA-1 7973.8 M/s
SHA-256 3110.2 M/s
SHA-512 267.1 M/s
NTLM 44035.3 M/s
DES 185.1 M/s
WPA/WPA2 348.0 k/s

The OCI image-spec pointed out that an image could use any unregistered algorithm for digestion, an unrecognized digested will pass validation. However, in open source registries django-oci and distribution/distribution(the core library for many registry operators including Docker Hub, GitHub Container Registry, GitLab Container Registry and DigitalOcean Container Registry), they would reject any unsupported algorithm. For this reason, we could not pick a faster algorithm, like xxHash.

Though opencontainer group proposed a new hash algorithm blake3 to speed up the hash procedure at multi-cpu machines, it's still considered as an alternate algorithm, and unsupported in the above registries till now. related issue: https://github.com/opencontainers/go-digest/pull/66

Conclusion

In the above discussions, we concluded that most of the time consumption of OCI upload is from calculating sha256, while S3 uses contentMd5 to validate uploaded files. Moreover, md5 is optional to S3, so user could trade off their requirement for speed versus correctness at their upload.

Though sha256 is much slower than md5, we could not get rid of it with a new algorithm like xxHash as OCI spec is not fully supported by registries. The official solution blake3 has not yet been supported by them either.

It is impossible to accelerate OCI to speed up to S3 level before any progress of opencontainer organization.

cutecutecat avatar Feb 09 '23 02:02 cutecutecat

It requires a cryptographic hash algorithm, thus you cannot use something like md5 or xxHash. I guess we need to wait for the black3.

kemingy avatar Feb 09 '23 03:02 kemingy

in the last two months, I am developing a LLM model. If you guys have any questions about LLM over 200GB, I am willing to give you feedbcks.

aseaday avatar Feb 09 '23 03:02 aseaday