containerized-data-importer
containerized-data-importer copied to clipboard
Add expected_hash to DataVolume
/kind enhancement
qemu images may be garbled en route from an http server to CDI's target. To identify this condition, many servers (e.g. Fedora provide the expected checksum of the image.
I would like to state the expected hash in the DataVolume spec, and have the import fail if any of the provided hashes does not match the imported data.
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: "example-import-dv"
spec:
source:
http:
url: "http://mirror.isoc.org.il/pub/fedora/releases/33/Cloud/x86_64/images/Fedora-Cloud-Base-33-1.2.x86_64.raw.xz"
hashes:
sha256: 35fa778f5d4830b58f7baf121fff6bd2b52500411c9abc46761b29a690415c3f
length: 203308980
This stackoverflow answer may provide an elegant way to integrate hashing into our stream readers.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale
@mhenriks would you express here why this is nontrivial to implement?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle rotten
/lifecycle frozen
I want this too :)
I think this is essential. For containerDisks this is kind of built-in (at least I hope that skopeo does these checks).
This stackoverflow answer may provide an elegant way to integrate hashing into our stream readers.
Can confirm that this pattern works great. We use it for instance here: https://github.com/kubevirt/containerdisks/blob/main/pkg/http/http.go#L38
@rmohr @dankenigsberg
Can confirm that this pattern works great. We use it for instance here: https://github.com/kubevirt/containerdisks/blob/main/pkg/http/http.go#L38
IMO it is not secure to compute the checksum at t0 and assume it is the same at t1. Especially with http (no s) url.
The only truly secure way to do this is to download the file to scratch space, compute the checksum (can be done while downloading), then use downloaded file. And this is definitely something we can do but will make certain operations slower (http(s) qcow2 or raw)
IMO it is not secure to compute the checksum at t0 and assume it is the same at t1. Especially with http (no s) url.
Could you explain what you mean? What is the difference to computing the checksum during the download (stream) compared to first downloading and then computing it? I am not suggesting to first compute the checksum from the remote location and then download the file.
Edit: Oh you are probably talking about directly pushing and releasing. while it streams through. Yes, right. If the target has no toggle which can be triggered after the import to make it usable, then yes definitely.
Edit: Oh you are probably talking about directly pushing and releasing. while it streams through. Yes, right. If the target has no toggle which can be triggered after the import to make it usable, then yes definitely.
Yeah, for example we can currently directly stream a qcow2 file to raw via http. This requires no scratch space. To validate the qcow2 checksum will require downloading the qcow2 file to scratch space.
Yeah, for example we can currently directly stream a qcow2 file to raw via http. This requires no scratch space. To validate the qcow2 checksum will require downloading the qcow2 file to scratch space.
I am not sure I understand this example? Are you talking about cases where you internally use a tool which requires a http source and does the download itself? Otherwise it seems to me like you could always calculate the checksum while you convert with a tee reader, or not?
You could still fail the import after the full download where you finally know that the checksum is bad.
I am not sure I understand this example? Are you talking about cases where you internally use a tool which requires a http source and does the download itself? Otherwise it seems to me like you could always calculate the checksum while you convert with a tee reader, or not?
nbdkit is used to give CDI one interface to a bunch of files/formats on the other end of a url. My assumption is that when checksum is provided CDI will first download/validate the file then point nbdkit to local file rather than http.
nbdkit is used to give CDI one interface to a bunch of files/formats on the other end of a url. My assumption is that when checksum is provided CDI will first download/validate the file then point nbdkit to local file rather than http.
In this case I would probably aim for providing sockets or file descriptors (e.g. pipes) to nbdkit (unless ndbkit supports checksums directly).
In this case I would probably aim for providing sockets or file descriptors to nbdkit (unless ndbkit supports checksums directly).
If nbdkit does not access the file sequentially I don't think that we will be able to efficiently compute checksums. And I'm pretty sure it does not access qcow sequentially
If nbdkit does not access the file sequentially I don't think that we will be able to efficiently compute checksums. And I'm pretty sure it does not access qcow sequentially
If this is needed, definitely :)
Will a Merkle Hash be better than a linear hash? See for example https://listman.redhat.com/archives/libguestfs/2023-February/030908.html which describes how blkhash can come up with faster hash results by using a Merkle Tree
Hey @mhenriks, do you think this would make sense in the context of populators?
Created a Jira card to track this issue https://issues.redhat.com/browse/CNV-31631. Since a single VolumeImportSource can be used for several PVCs I think this might be more useful for populators.
Hey @dankenigsberg, following some discussions with the team we've concluded that this feature would require a significant divergence from our import flow (https://github.com/kubevirt/containerized-data-importer/issues/1520#issuecomment-1020189814), which seems detrimental for most use cases and overkill for a low priority feature. I'm closing this issue and moving the story to the backlog, but feel free to reopen if necessary. Thanks!