containerized-data-importer Add expected

trafficstars

/kind enhancement

qemu images may be garbled en route from an http server to CDI's target. To identify this condition, many servers (e.g. Fedora provide the expected checksum of the image.

I would like to state the expected hash in the DataVolume spec, and have the import fail if any of the provided hashes does not match the imported data.

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: "example-import-dv"
spec:
  source:
      http:
         url: "http://mirror.isoc.org.il/pub/fedora/releases/33/Cloud/x86_64/images/Fedora-Cloud-Base-33-1.2.x86_64.raw.xz"
      hashes:
         sha256: 35fa778f5d4830b58f7baf121fff6bd2b52500411c9abc46761b29a690415c3f
         length: 203308980

Dec 08 '20 15:12 dankenigsberg

This stackoverflow answer may provide an elegant way to integrate hashing into our stream readers.

Dec 14 '20 21:12 aglitke

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Mar 14 '21 21:03 kubevirt-bot

/remove-lifecycle stale

Mar 15 '21 04:03 dankenigsberg

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Jun 13 '21 04:06 kubevirt-bot

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Sep 11 '21 05:09 kubevirt-bot

/remove-lifecycle stale

@mhenriks would you express here why this is nontrivial to implement?

Sep 12 '21 09:09 dankenigsberg

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Dec 11 '21 10:12 kubevirt-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

Jan 10 '22 10:01 kubevirt-bot

/remove-lifecycle rotten

Jan 10 '22 10:01 dankenigsberg

/lifecycle frozen

I want this too :)

I think this is essential. For containerDisks this is kind of built-in (at least I hope that skopeo does these checks).

Jan 13 '22 16:01 rmohr

This stackoverflow answer may provide an elegant way to integrate hashing into our stream readers.

Can confirm that this pattern works great. We use it for instance here: https://github.com/kubevirt/containerdisks/blob/main/pkg/http/http.go#L38

Jan 13 '22 16:01 rmohr

@rmohr @dankenigsberg

Can confirm that this pattern works great. We use it for instance here: https://github.com/kubevirt/containerdisks/blob/main/pkg/http/http.go#L38

IMO it is not secure to compute the checksum at t0 and assume it is the same at t1. Especially with http (no s) url.

The only truly secure way to do this is to download the file to scratch space, compute the checksum (can be done while downloading), then use downloaded file. And this is definitely something we can do but will make certain operations slower (http(s) qcow2 or raw)

Jan 24 '22 14:01 mhenriks

IMO it is not secure to compute the checksum at t0 and assume it is the same at t1. Especially with http (no s) url.

Could you explain what you mean? What is the difference to computing the checksum during the download (stream) compared to first downloading and then computing it? I am not suggesting to first compute the checksum from the remote location and then download the file.

Edit: Oh you are probably talking about directly pushing and releasing. while it streams through. Yes, right. If the target has no toggle which can be triggered after the import to make it usable, then yes definitely.

Jan 24 '22 15:01 rmohr

Edit: Oh you are probably talking about directly pushing and releasing. while it streams through. Yes, right. If the target has no toggle which can be triggered after the import to make it usable, then yes definitely.

Yeah, for example we can currently directly stream a qcow2 file to raw via http. This requires no scratch space. To validate the qcow2 checksum will require downloading the qcow2 file to scratch space.

Jan 24 '22 16:01 mhenriks

Yeah, for example we can currently directly stream a qcow2 file to raw via http. This requires no scratch space. To validate the qcow2 checksum will require downloading the qcow2 file to scratch space.

I am not sure I understand this example? Are you talking about cases where you internally use a tool which requires a http source and does the download itself? Otherwise it seems to me like you could always calculate the checksum while you convert with a tee reader, or not?

You could still fail the import after the full download where you finally know that the checksum is bad.

Jan 24 '22 16:01 rmohr

I am not sure I understand this example? Are you talking about cases where you internally use a tool which requires a http source and does the download itself? Otherwise it seems to me like you could always calculate the checksum while you convert with a tee reader, or not?

nbdkit is used to give CDI one interface to a bunch of files/formats on the other end of a url. My assumption is that when checksum is provided CDI will first download/validate the file then point nbdkit to local file rather than http.

Jan 24 '22 16:01 mhenriks

nbdkit is used to give CDI one interface to a bunch of files/formats on the other end of a url. My assumption is that when checksum is provided CDI will first download/validate the file then point nbdkit to local file rather than http.

In this case I would probably aim for providing sockets or file descriptors (e.g. pipes) to nbdkit (unless ndbkit supports checksums directly).

Jan 24 '22 16:01 rmohr

In this case I would probably aim for providing sockets or file descriptors to nbdkit (unless ndbkit supports checksums directly).

If nbdkit does not access the file sequentially I don't think that we will be able to efficiently compute checksums. And I'm pretty sure it does not access qcow sequentially

Jan 24 '22 16:01 mhenriks

If nbdkit does not access the file sequentially I don't think that we will be able to efficiently compute checksums. And I'm pretty sure it does not access qcow sequentially

If this is needed, definitely :)

Jan 24 '22 16:01 rmohr

Will a Merkle Hash be better than a linear hash? See for example https://listman.redhat.com/archives/libguestfs/2023-February/030908.html which describes how blkhash can come up with faster hash results by using a Merkle Tree

Mar 01 '23 17:03 ebblake

Hey @mhenriks, do you think this would make sense in the context of populators?

Aug 02 '23 13:08 alromeros

Created a Jira card to track this issue https://issues.redhat.com/browse/CNV-31631. Since a single VolumeImportSource can be used for several PVCs I think this might be more useful for populators.

Aug 02 '23 14:08 alromeros

Hey @dankenigsberg, following some discussions with the team we've concluded that this feature would require a significant divergence from our import flow (https://github.com/kubevirt/containerized-data-importer/issues/1520#issuecomment-1020189814), which seems detrimental for most use cases and overkill for a low priority feature. I'm closing this issue and moving the story to the backlog, but feel free to reopen if necessary. Thanks!

Oct 23 '23 09:10 alromeros

containerized-data-importer
containerized-data-importer copied to clipboard

Add expected_hash to DataVolume

containerized-data-importer containerized-data-importer copied to clipboard

Add expected_hash to DataVolume

containerized-data-importer
containerized-data-importer copied to clipboard