kubo
kubo copied to clipboard
Verify a file/folder
Checklist
- [X] My issue is specific & actionable.
- [X] I am not suggesting a protocol enhancement.
- [X] I have searched on the issue tracker for my issue.
Description
It would be nice if I could supply a CID like a regular SHA hash and a file/folder and let kubo fetch the necessary metadata and check the local data for errors.
Rationale
I wrote a feature request for the usage of Kubo in Lutris and I noticed that this would be kinda neat, as they may want to fetch downloads via http(s) and only if this fails fetch them via ipfs.
This way they don't need a hash and a CID, but could just use the CID and have the URL once imported via URL-Store.
Usage
kubo verify --file=/path/to/file /ipfs/CID
kubo verify --ignore-auxiliary=1 --folder=/path/to/folder/ /ipfs/CID
(Where --ignore-auxiliary=1 in the folder version would ignore any additional files/folders like .git
)
kubo verify /ipfs/CID
Would work analog to kubo get
where it would search in the ./$CID
folder for the file/folder structure.
Edit: Fix language
@RubenKelevra how do you implement that technically ? IPFS hashes are not hash of the files, but hashes of the DAG (because this allows to parallelize downloads.)
@RubenKelevra how do you implement that technically ? IPFS hashes are not hash of the files, but hashes of the DAG (because this allows to parallelize downloads.)
Yeah sure. Idea is, that you just need metadata fetched from the network to do the verification:
- CID itself specifies the hash algorithm
- The DAG may be different if trickle or not, but doesn't matter here
- The DAG contains if the blocks are raw leaves or not
- You don't need to know the chunker, as you only need to duplicate the cut marks stored in the DAG as length for each block of the file.
So you would basically read the metadata from the network and instead of fetching the data from the network, do an offset read of the local file.
If the file is equal to what you would read from the network, the file is verified.
@RubenKelevra your solution is creative as it indeed does not require more metadata than what is already contained in unixfs and CIDs today. However it requires to fetch some blocks (all the non leaf blocks) over the network which might be confusing for some peoples.
I was expecting something along the line of encoding the chunker parameters in the CID or first root block which would have been a no, (it would require standard and repeatable chunkers which is a huge pain in the ass to do with updates (like the MFS do)).
We already have other things that woud like to bundle all the non leaf blocks into a car file or smth like that for fast leaf discovery (would be like a .torrent
file but for IPFS data), so maybe requiring those blocks for verification isn't too far fetch.
I'm leaving this open for other people to say what they think.
Yeah, I would prefer a way to encode everything into the CID, too. But I think that's more likely fitting into a conceptual requirement for a version 2 of the CID itself.
I think fetching metadata from the network is fine. I don't feel like that's confusing for a user. When I look back, I was more confused that I can run certain IPFS commands without running a daemon – aka offline 😂
As an added benefit, the node which tests the integrity will fetch the whole metadata, so it replicates everything which is necessary to import the file locally (which would be my next feature request building on this one).
If I have a file and a CID for it, it would be neat if I could just import the CID and provide the file instead of fetching the same data from the network, just to provide it.
I think fetching metadata from the network is fine. I don't feel like that's confusing for a user.
We say to people that CIDs are just hashes (which they are). And sha256sum
can check hashes of files without doing any network access, why couldn't Kubo do it to ?
(obviously the answer is that IPFS chunk and DAGify the files to make them P2P friendly, but not everyone knows that)
UX idea: if you rename it from verify
to something like files compare
then this command makes more sense, and the fact it will fetch some blocks is no longer weird.
If we add a an error like:
Some of the files were not built with the newer and more efficient --raw-leaves encoding, this might require to download all the binary content to verify.
Use --verify-protobuf-wrapped-leaves to allows more expensive verification.
for people to correctly expect what happen with (bad) files.
Even if there is workarounds (like trying to predict if a dag-pb is a leaf, rewrap it on our side and hash the wrapped one) this would be inneficient.
If we do that (and name it compare) I'm fine.
@Jorropo
Another option would be to create a "cid-pack" format. So if you got a cid, this cid-pack would contain any additional information to verify the file without any network connectivity.
So if you have a network connectivity, and not the files, but the CID, you can compile a cid-pack. If you then have the files but no network access, you can verify the files.
The main advantage over regular sha512 sums would be, similar to torrent files, that you know which part is broken and even with a dumb http server as source you could do a range request to refetch the info.
UX idea: if you rename it from
verify
to something likefiles compare
then this command makes more sense, and the fact it will fetch some blocks is no longer weird.
Except that ipfs files
is already in use. This sounds you like to compare a file inside the MFS to a CID, which hardly makes sense, as you already have it in the MFS.
I always try to make it readable:
"Let ipfs verify --file=a
[with] content-id
"
"Let ipfs verify content-id
"
This doesn't work with ipfs files compare
:
"Let ipfs files compare content-id
"
@RubenKelevra Aren't CAR files what "cid-pack" files would be? https://github.com/ipld/go-car
@RubenKelevra Aren't CAR files what "cid-pack" files would be? https://github.com/ipld/go-car
Well, car files must contain the data to be valid, mustn't them?
Oh, I see what you mean now, nevermind then
Another option would be to create a "cid-pack" format. So if you got a cid, this cid-pack would contain any additional information to verify the file without any network connectivity.
So if you have a network connectivity, and not the files, but the CID, you can compile a cid-pack. If you then have the files but no network access, you can verify the files.
The main advantage over regular sha512 sums would be, similar to torrent files, that you know which part is broken and even with a dumb http server as source you could do a range request to refetch the info.
I meant that with:
We already have other things that woud like to bundle all the non leaf blocks into a car file or smth like that for fast leaf discovery (would be like a .torrent file but for IPFS data), so maybe requiring those blocks for verification isn't too far fetch.
Well, car files must contain the data to be valid, mustn't them?
No, car files are just lists of blocks with a header and optional features, nothing say they need to be complete or anything, you could just give the root blocks but not the leaves.
2022-09-02 triage notes:
- We labeled this issue. It will require someone from the community to pick up, at least currently. Maintainers won't be able to take it on.
- @Jorropo had the idea that this could be useful for verifying UnixFS implementations.