Deny crates with unavailable source code or unclear correspondence between crate registry release and version control history
Is your feature request related to a problem? Please describe.
Occasionally we will try to debug an issue in a crate, and go looking for its source code, only to find that one of the following is true:
- The repository URL in
Cargo.tomlleads to a 404 - The repository URL in
Cargo.tomlpoints to a different fork of the crate, such that it does not correspond to the release we actually got from crates.io - The repository URL in
Cargo.tomlis correct, but the release has been pushed to crates.io without the corresponding commits being pushed to the repository. (Surprisingly common with crates mostly maintained by one person who may forget to push.) - The repository URL in
Cargo.tomlis correct, and the repository probably contains the commits, but it is difficult to determine which commit corresponds to the release, because of an absence of tags / commit messages / changelog entries that would make it obvious.
Describe the solution you'd like
Ideally we'd like to deny all of the above situations, since they are all difficult to distinguish from a malicious release.
(1) and (2) above seem quite achievable for cargo-deny to detect.
(3) and (4) could be more difficult. An approach could be taken that requires tags in the repository with a recognisable name format (0.1.2 or v0.1.2 seem most common), that correspond to releases in the crate registry. This would generate a lot of warnings at first, because many common crates do not consistently tag releases, but perhaps with some encouragement things could be improved there.
Agree that it would be quite powerful to be able to verify that the crate and the git repo it was published from was matching and warn or require that (with exceptions). But probably would be quite hard to implement in practice and as there are no standards or other requirements on it.
Though interesting to investigate, a key challenge would be how to make it fast also, pretty common for projects to depend on 300+ crates and syncing all of the crates from their git repos would be really quite slow to do and potentially a lot of bandwidth (esp. some that contain submodules and such also).
One thing that I think would be interesting to explore is to use the .cargo_vcs_info.json file that Cargo creates and includes in each published crate - so that the you already have for each synced crate and for example for addr2line-0.14.1 it looks like this:
{
"git": {
"sha1": "f7053dd93cb9dc2feb59f459ab7c483a4ed15e22"
}
}
And as addr2line crate in its Cargo.toml do specify repository = "https://github.com/gimli-rs/addr2line" one can actually directly look up that commit on github and see that is a valid one with https://github.com/gimli-rs/addr2line/commit/f7053dd93cb9dc2feb59f459ab7c483a4ed15e22: https://github.com/gimli-rs/addr2line/commit/f7053dd93cb9dc2feb59f459ab7c483a4ed15e22
If it was an invalid commit one gets back a 404 from GitHub. This syntax is of course git provider specific so wouldn't work for every repo, but could be a mechanism to avoid syncing the repo from git and one could support multiples ones also for gitlab etc.
May be standard git protocols for remotes one can use also to see which commits / tags are available without syncing the entire repo first also?
Yes, this can be done with standard Git, but it does require server support (uploadpack.allowAnySHA1InWant) which is not always enabled. GitHub seems to allow it.
I can demonstrate it using the redis crate, which recently had a 0.20.1 release that doesn't correspond to any visible commits on GitHub. Thanks to your .cargo_vcs_info.json trick (TIL!) I was able to find out the corresponding commit, which it turns out is present in the repository but not associated with any branch.
I can fetch that single commit from GitHub as follows:
$ git init tmprepo
Initialized empty Git repository in tmprepo/.git/
$ cd tmprepo
$ du -hs .
76K .
$ git fetch --depth=1 https://github.com/mitsuhiko/redis-rs.git 9b6f35dad8e865e5abee3c429be770b4b9e08517
remote: Enumerating objects: 80, done.
remote: Counting objects: 100% (80/80), done.
remote: Compressing objects: 100% (68/68), done.
remote: Total 80 (delta 3), reused 40 (delta 0), pack-reused 0
Unpacking objects: 100% (80/80), 133.38 KiB | 794.00 KiB/s, done.
From https://github.com/mitsuhiko/redis-rs
* branch 9b6f35dad8e865e5abee3c429be770b4b9e08517 -> FETCH_HEAD
$ echo $?
0
$ du -hs .
468K .
If I change a single digit of the commit hash, the same command fails and the exit code can be used to see that the commit did not exist:
$ git fetch --depth=1 https://github.com/mitsuhiko/redis-rs.git 9b6f35dad8e865e5abee3c429be770b4b9e08516
fatal: remote error: upload-pack: not our ref 9b6f35dad8e865e5abee3c429be770b4b9e08516
$ echo $?
128
In theory if one was speaking the Git protocol without using the git tool, one could probably abandon the fetch as soon as it's clear the object exists, to avoid needing to actually download the data.
Ah neat, would indeed be good if it was possible for it also to not do the actual fetch, just verify if the commit exists
If the crate wasn't dirty when published, it should contain the exact commit hash. Accordingly, it'd just need to ban dirty crate + confirm availability of non-dirty ones.
I'm interested in this exact feature ever since a dependency of a major cryptography library didn't publish its source for roughly a year.