kubo icon indicating copy to clipboard operation
kubo copied to clipboard

Verify a file/folder

Open RubenKelevra opened this issue 1 year ago • 14 comments

Checklist

  • [X] My issue is specific & actionable.
  • [X] I am not suggesting a protocol enhancement.
  • [X] I have searched on the issue tracker for my issue.

Description

It would be nice if I could supply a CID like a regular SHA hash and a file/folder and let kubo fetch the necessary metadata and check the local data for errors.

Rationale

I wrote a feature request for the usage of Kubo in Lutris and I noticed that this would be kinda neat, as they may want to fetch downloads via http(s) and only if this fails fetch them via ipfs.

This way they don't need a hash and a CID, but could just use the CID and have the URL once imported via URL-Store.

Usage

kubo verify --file=/path/to/file /ipfs/CID
kubo verify --ignore-auxiliary=1 --folder=/path/to/folder/ /ipfs/CID 

(Where --ignore-auxiliary=1 in the folder version would ignore any additional files/folders like .git)

kubo verify /ipfs/CID

Would work analog to kubo get where it would search in the ./$CID folder for the file/folder structure.

Edit: Fix language

RubenKelevra avatar Aug 08 '22 12:08 RubenKelevra

@RubenKelevra how do you implement that technically ? IPFS hashes are not hash of the files, but hashes of the DAG (because this allows to parallelize downloads.)

Jorropo avatar Aug 08 '22 12:08 Jorropo

@RubenKelevra how do you implement that technically ? IPFS hashes are not hash of the files, but hashes of the DAG (because this allows to parallelize downloads.)

Yeah sure. Idea is, that you just need metadata fetched from the network to do the verification:

  • CID itself specifies the hash algorithm
  • The DAG may be different if trickle or not, but doesn't matter here
  • The DAG contains if the blocks are raw leaves or not
  • You don't need to know the chunker, as you only need to duplicate the cut marks stored in the DAG as length for each block of the file.

So you would basically read the metadata from the network and instead of fetching the data from the network, do an offset read of the local file.

If the file is equal to what you would read from the network, the file is verified.

RubenKelevra avatar Aug 08 '22 12:08 RubenKelevra

@RubenKelevra your solution is creative as it indeed does not require more metadata than what is already contained in unixfs and CIDs today. However it requires to fetch some blocks (all the non leaf blocks) over the network which might be confusing for some peoples.

I was expecting something along the line of encoding the chunker parameters in the CID or first root block which would have been a no, (it would require standard and repeatable chunkers which is a huge pain in the ass to do with updates (like the MFS do)).

We already have other things that woud like to bundle all the non leaf blocks into a car file or smth like that for fast leaf discovery (would be like a .torrent file but for IPFS data), so maybe requiring those blocks for verification isn't too far fetch.

I'm leaving this open for other people to say what they think.

Jorropo avatar Aug 08 '22 12:08 Jorropo

Yeah, I would prefer a way to encode everything into the CID, too. But I think that's more likely fitting into a conceptual requirement for a version 2 of the CID itself.

I think fetching metadata from the network is fine. I don't feel like that's confusing for a user. When I look back, I was more confused that I can run certain IPFS commands without running a daemon – aka offline 😂

As an added benefit, the node which tests the integrity will fetch the whole metadata, so it replicates everything which is necessary to import the file locally (which would be my next feature request building on this one).

If I have a file and a CID for it, it would be neat if I could just import the CID and provide the file instead of fetching the same data from the network, just to provide it.

RubenKelevra avatar Aug 08 '22 12:08 RubenKelevra

I think fetching metadata from the network is fine. I don't feel like that's confusing for a user.

We say to people that CIDs are just hashes (which they are). And sha256sum can check hashes of files without doing any network access, why couldn't Kubo do it to ?

(obviously the answer is that IPFS chunk and DAGify the files to make them P2P friendly, but not everyone knows that)

Jorropo avatar Aug 08 '22 13:08 Jorropo

UX idea: if you rename it from verify to something like files compare then this command makes more sense, and the fact it will fetch some blocks is no longer weird.

lidel avatar Aug 11 '22 14:08 lidel

If we add a an error like:

Some of the files were not built with the newer and more efficient --raw-leaves encoding, this might require to download all the binary content to verify.

Use --verify-protobuf-wrapped-leaves to allows more expensive verification.

for people to correctly expect what happen with (bad) files.

Even if there is workarounds (like trying to predict if a dag-pb is a leaf, rewrap it on our side and hash the wrapped one) this would be inneficient.

If we do that (and name it compare) I'm fine.

Jorropo avatar Aug 11 '22 15:08 Jorropo

@Jorropo

Another option would be to create a "cid-pack" format. So if you got a cid, this cid-pack would contain any additional information to verify the file without any network connectivity.

So if you have a network connectivity, and not the files, but the CID, you can compile a cid-pack. If you then have the files but no network access, you can verify the files.

The main advantage over regular sha512 sums would be, similar to torrent files, that you know which part is broken and even with a dumb http server as source you could do a range request to refetch the info.

RubenKelevra avatar Aug 14 '22 09:08 RubenKelevra

UX idea: if you rename it from verify to something like files compare then this command makes more sense, and the fact it will fetch some blocks is no longer weird.

Except that ipfs files is already in use. This sounds you like to compare a file inside the MFS to a CID, which hardly makes sense, as you already have it in the MFS.

I always try to make it readable:

"Let ipfs verify --file=a [with] content-id" "Let ipfs verify content-id"

This doesn't work with ipfs files compare:

"Let ipfs files compare content-id"

RubenKelevra avatar Aug 14 '22 09:08 RubenKelevra

@RubenKelevra Aren't CAR files what "cid-pack" files would be? https://github.com/ipld/go-car

Winterhuman avatar Aug 14 '22 20:08 Winterhuman

@RubenKelevra Aren't CAR files what "cid-pack" files would be? https://github.com/ipld/go-car

Well, car files must contain the data to be valid, mustn't them?

RubenKelevra avatar Aug 14 '22 21:08 RubenKelevra

Oh, I see what you mean now, nevermind then

Winterhuman avatar Aug 14 '22 22:08 Winterhuman

Another option would be to create a "cid-pack" format. So if you got a cid, this cid-pack would contain any additional information to verify the file without any network connectivity.

So if you have a network connectivity, and not the files, but the CID, you can compile a cid-pack. If you then have the files but no network access, you can verify the files.

The main advantage over regular sha512 sums would be, similar to torrent files, that you know which part is broken and even with a dumb http server as source you could do a range request to refetch the info.

I meant that with:

We already have other things that woud like to bundle all the non leaf blocks into a car file or smth like that for fast leaf discovery (would be like a .torrent file but for IPFS data), so maybe requiring those blocks for verification isn't too far fetch.


Well, car files must contain the data to be valid, mustn't them?

No, car files are just lists of blocks with a header and optional features, nothing say they need to be complete or anything, you could just give the root blocks but not the leaves.

Jorropo avatar Aug 14 '22 22:08 Jorropo

2022-09-02 triage notes:

  1. We labeled this issue. It will require someone from the community to pick up, at least currently. Maintainers won't be able to take it on.
  2. @Jorropo had the idea that this could be useful for verifying UnixFS implementations.

BigLep avatar Sep 02 '22 16:09 BigLep