new feature: Checksum
Feature Description
Checksum is extremely important for storage services. We currently support some checksum features in OpenDAL, but they are very limited: we only support setting checksum_algorithm for S3.
I hope we can introduce this feature across all services, supporting newer and faster checksum algorithms such as crc64-nvme or crc64-ecma.
Problem and Solution
- Add
checksum_algorithmfor all supported services. - Add additional
checksum_algorithmoptions for S3, such ascrc64-nvme. - We can also add some slow checksums like sha1 and md5
Additional Context
- Perhaps we should also expose this at the metadata level.
- Should we verify the checksum during the reading process?
- Do we need to introduce a new error type for this?
- We should re-consider the current design.
- Is it a good idea to add at service level?
- Is it better to have a
ChecksumLayerfor this?
Are you willing to contribute to the development of this feature?
- [ ] Yes, I am willing to contribute to the development of this feature.
I'm interested in exploring the additional context mentioned here. Can I join you, @geetanshjuneja and maybe we can enrich the implementation involved here together? 😊
Hi, @zjregee, welcome! We definitely need more research in this area. End-to-end checksum (both reading and writing) is complex in OpenDAL's position, as we want to leverage the full power of the service while also providing a great API for users.
Yes, I think checksums will be a necessary and important guarantee, and it is worth doing some preliminary research on this.
The problem with ChecksumLayer is that services have different checksum support, making it difficult to unify them.
To make things more complex, S3 handles checksums in two different ways: full-object checksums and part-object checksums.
full-object means passing the full-object checksums while calling Writer::close(), part-object checksums means that using the checksum of different parts to build one.
Hi, @Xuanwo.
I'm sorry about that I didn't notice the shift in discussion and I expressed an original thought here, https://github.com/apache/opendal/issues/5635#issuecomment-2664905328.
Yes, these are all features we need to take into account. In order to achieve a universal solution, I think we can look at more storage services to see if these features are universal or if there are some features that are difficult to universal.
As I mentioned in the https://github.com/apache/opendal/issues/5635#issuecomment-2664905328, I would be happy to first research and provide a list of checksum support or usage by different services.
If this is still open, I can pick it up. @Xuanwo
If this is still open, I can pick it up. @Xuanwo
Hi, thank you @uruemu for your interest.
While this issue is indeed open, it is still in the very early stages of research, and there is no coding work available to take on at this time.
@Xuanwo makes sense.
Is anyone actively doing this research? Or is it open to pick up even if it's just in discovery? I might be able to put some bandwidth into that
Is anyone actively doing this research? Or is it open to pick up even if it's just in discovery? I might be able to put some bandwidth into that
Welcome to join in the research. Your input is always appreciated.