opendal icon indicating copy to clipboard operation
opendal copied to clipboard

new feature: Checksum

Open Xuanwo opened this issue 11 months ago • 9 comments

Feature Description

Checksum is extremely important for storage services. We currently support some checksum features in OpenDAL, but they are very limited: we only support setting checksum_algorithm for S3.

I hope we can introduce this feature across all services, supporting newer and faster checksum algorithms such as crc64-nvme or crc64-ecma.

Problem and Solution

  • Add checksum_algorithm for all supported services.
  • Add additional checksum_algorithm options for S3, such as crc64-nvme.
  • We can also add some slow checksums like sha1 and md5

Additional Context

  • Perhaps we should also expose this at the metadata level.
  • Should we verify the checksum during the reading process?
  • Do we need to introduce a new error type for this?
  • We should re-consider the current design.
    • Is it a good idea to add at service level?
    • Is it better to have a ChecksumLayer for this?

Are you willing to contribute to the development of this feature?

  • [ ] Yes, I am willing to contribute to the development of this feature.

Xuanwo avatar Jan 15 '25 08:01 Xuanwo

I'm interested in exploring the additional context mentioned here. Can I join you, @geetanshjuneja and maybe we can enrich the implementation involved here together? 😊

zjregee avatar Feb 07 '25 09:02 zjregee

Hi, @zjregee, welcome! We definitely need more research in this area. End-to-end checksum (both reading and writing) is complex in OpenDAL's position, as we want to leverage the full power of the service while also providing a great API for users.

Xuanwo avatar Feb 07 '25 10:02 Xuanwo

Yes, I think checksums will be a necessary and important guarantee, and it is worth doing some preliminary research on this.

zjregee avatar Feb 07 '25 10:02 zjregee

The problem with ChecksumLayer is that services have different checksum support, making it difficult to unify them.

To make things more complex, S3 handles checksums in two different ways: full-object checksums and part-object checksums.

Image

full-object means passing the full-object checksums while calling Writer::close(), part-object checksums means that using the checksum of different parts to build one.

Xuanwo avatar Feb 18 '25 08:02 Xuanwo

Hi, @Xuanwo.

I'm sorry about that I didn't notice the shift in discussion and I expressed an original thought here, https://github.com/apache/opendal/issues/5635#issuecomment-2664905328.

Yes, these are all features we need to take into account. In order to achieve a universal solution, I think we can look at more storage services to see if these features are universal or if there are some features that are difficult to universal.

As I mentioned in the https://github.com/apache/opendal/issues/5635#issuecomment-2664905328, I would be happy to first research and provide a list of checksum support or usage by different services.

zjregee avatar Feb 18 '25 09:02 zjregee

If this is still open, I can pick it up. @Xuanwo

uruemu avatar Apr 23 '25 01:04 uruemu

If this is still open, I can pick it up. @Xuanwo

Hi, thank you @uruemu for your interest.

While this issue is indeed open, it is still in the very early stages of research, and there is no coding work available to take on at this time.

Xuanwo avatar Apr 23 '25 03:04 Xuanwo

@Xuanwo makes sense.

Is anyone actively doing this research? Or is it open to pick up even if it's just in discovery? I might be able to put some bandwidth into that

uruemu avatar Apr 23 '25 23:04 uruemu

Is anyone actively doing this research? Or is it open to pick up even if it's just in discovery? I might be able to put some bandwidth into that

Welcome to join in the research. Your input is always appreciated.

Xuanwo avatar Apr 24 '25 01:04 Xuanwo