composer icon indicating copy to clipboard operation
composer copied to clipboard

SHA1 sum of package dists should be more reliable than a SHA1 on the zip result

Open Seldaek opened this issue 10 years ago • 47 comments

(Replacing #1496 which has become a mess, references #5940)

If multiple servers create archives, then those archives can have different SHA1s which is problematic. Potential solutions:

  • hash the contents in a reproducible way (might be very slow)
  • just avoid recreating archives all the time (depends on ecosystem but preferable)

Seldaek avatar Dec 31 '13 16:12 Seldaek

Is there a solution until now?

fadoe avatar May 13 '14 09:05 fadoe

I'm not an expert in the ZIP format, but I guess the file access times (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) are changing and thereby messing up the checksum. Would it be an option the mask out the relevant parts?

david0 avatar Feb 13 '15 09:02 david0

@david0 it's a good theory/idea, either that or even modification times. I am not sure if these metadata bits are part of the header or if they are also compressed together with the file contents. If it's in the headers then maybe we could more or less easily mask them and then hash the outcome yes.

If anyone is up for digging further in the spec and possibly trying to build a small proof of concept (it really does not need to be part of composer as a first step) that'd be great.

Seldaek avatar Feb 14 '15 15:02 Seldaek

My proof of concept.

I also noticed that file access times etc. are stored in an optional field "extra". It is possible to omit this field when creating zip files using zip -X. So if we would find a way to create the zip-files without "extra" for satis, the problem should be solved.

david0 avatar Feb 14 '15 18:02 david0

@david0 ideally we should support this for zips created by github as well though where we can't control the process, so if we can strip the headers it'd be best.

I am not sure about the proof of concept because I get the same shasums anyway for the testdata1/2 and 3/4 couples. So the sha.php code does not prove much but it does indeed produce different shas that are also consistent.

It'd be nice if we could reproduce the exact issue (two different zips of same content) and get the same sha.

Seldaek avatar Feb 16 '15 14:02 Seldaek

@Seldaek must be some difference on your system (with atime). I updated the poc and included my testfiles.

$ shasum *.zip
689595f5f847e9edc4c01fb2c81a17472f009df5  testdata1_extra.zip
0911f01200a2a9a0b7f251b7308fcbe3a18c8f56  testdata2_extra.zip
312d94af3ac4c929169e5e515b53ee69be4b1e1c  testdata3_noextra.zip
312d94af3ac4c929169e5e515b53ee69be4b1e1c  testdata4_noextra.zip
34ef754c68ffbb559954f7932b70e46b5d826cc3  testdata5_corrupted_extra.zip
823ced3f35d98acbb4cb33dd1b1bb25e2f611323  testdata5_corrupted_noextra.zip
$ php sha.php
testdata1_extra.zip: e122eb938e4a680503b97c278f71583d27537463
testdata2_extra.zip: e122eb938e4a680503b97c278f71583d27537463
testdata3_noextra.zip: 31b74a5869270919e41f185f593f14306d9dc8c9
testdata4_noextra.zip: 31b74a5869270919e41f185f593f14306d9dc8c9
testdata5_corrupted_extra.zip: 63eae002cb6deb46240d300308ea783f9fd41735
testdata5_corrupted_noextra.zip: 3b0d719237c790a00cf28fc998ebf6b7dc2a8f71

It would be also possible to make testdata 1-4 have the same hash by ignoring the central directory structure also.

david0 avatar Feb 17 '15 20:02 david0

OK I get the same hashes as you do so that's a good start ;) Not sure what central directory structure is, does that include paths/directory structure of the unzipped output? Because ignoring that might be risky.

Seldaek avatar Feb 17 '15 20:02 Seldaek

The central directory structure seems the be a repetition of the file headers for random access. I updated the poc once again and now I get same hashes for testdata 1-4.

david0 avatar Feb 20 '15 15:02 david0

@Seldaek do you have some feedback about @david0's work ? It would be really interesting to fix these checksum "issue" IMO because it is causing some problems with private dependency management basically.

sroze avatar Aug 03 '15 13:08 sroze

@sroze just haven't had time to look into this in more details, but yes in theory it's a good way to do hashing for zip files, it's probably not very hard to add that as an optional hashing mechanism to verify the shasum of zips.

Seldaek avatar Aug 10 '15 00:08 Seldaek

What about the ability to disable the checksum until a real solution can be found?

rawkode avatar Aug 10 '15 15:08 rawkode

AFAIK packagist doesn't have checksums and toran proxy doesn't either, I am not sure about satis but overall checksums aren't really used atm due to the issues they caused.

Seldaek avatar Aug 10 '15 15:08 Seldaek

Well, Packagist does not use the checksum because the Github API does not provide a checksum for its archives (probably because it changes over time as they generate archives on demand for commits). However, Satis sets the checksum as it has the archive already and so can compute it

stof avatar Aug 10 '15 15:08 stof

We could also download the archives on packagist to get a checksum, that's not really the issue.

Seldaek avatar Aug 10 '15 15:08 Seldaek

So if I submit a PR with a config inside satis.json to allow disabling satis checksums inside packages.json - there'd be no initial objections?

rawkode avatar Aug 10 '15 16:08 rawkode

@Seldaek that would be both inefficient (downloading all archives in the GithubDriver just to get the checksum) and a bad idea (as Github archives are generated on the fly and so can have different metadata over time)

stof avatar Aug 10 '15 16:08 stof

I would agree to have an option (just a command-line one could be sufficient) to disable checksum checks. At the same time, having a real solution like this one would be far better.

@Seldaek if you're in with this idea, I can PR that --disable-checksums option.

sroze avatar Aug 11 '15 08:08 sroze

Any updates here? I am regularly seeing issues like this which slow down our deployments due to Composer falling back to Git cloning instead of downloading dist archives:

- Installing typo3/cms (6.2.17)
    Downloading
Failed to download typo3/cms from dist: The checksum verification of the file failed (downloaded from https://api.github.com/repos/TYPO3/TYPO3.CMS/zipball/449cf97a914c2dd719b728dc77d7b547ba557f33)
Now trying to download from source
- Installing typo3/cms (6.2.17)
Cloning 449cf97a914c2dd719b728dc77d7b547ba557f33

mbrodala avatar Feb 04 '16 09:02 mbrodala

how are you ending up with a checksum for github downloads ? Packagist and Composer don't add them as github does not provide checksums (and it cannot as they generate the archive on demand meaning that the checksum is not stable over time)

stof avatar Feb 04 '16 10:02 stof

@stof Good question, I cannot say what causes these checksums to show up. Could it be that Composer caches can cause this issue? I am unable to reproduce the issue when pulling typo3/cms locally but I can see it reliably in our build logs (Shippable).

mbrodala avatar Feb 04 '16 11:02 mbrodala

Any update to this? We're using satis and keep seeing this same issue...

oligriffiths avatar Apr 01 '16 14:04 oligriffiths

We also had the issue that we tried to keep the lockfile clean of any shasum values but they would always magically come back upon composer update ... even though they were not present on packagist.

Following @mbrodala s hint I deleted $HOME/.cache/composer but the checksums did still come back.

It turned out that those checksums were also present in vendor/installed.json and they stop reappearing out of nowhere once I also removed ./vendor and ran a clean composer install

fgrosse avatar Apr 19 '16 12:04 fgrosse

Whilst that may work, that doesn't seem like a solution to me. The real question is why are they different...

On 19 Apr 2016, at 08:06, Friedrich Große [email protected] wrote:

We also had the issue that we tried to keep the lockfile clean of any shasum values but they would always magically come back upon composer update ... even though they were not present on packagist.

Following @mbrodala s hint I deleted $HOME/.cache/composer but the checksums did still come back.

It turned out that those checksums were also present in vendor/installed.json and they stop reappearing out of nowhere once I also removed ./vendor and ran a clean composer install

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

oligriffiths avatar Apr 19 '16 12:04 oligriffiths

@oligriffiths in my usecase I did not have an issue with different checksums. However, because two zip checksums can potentially be different for the same content due to timestamps we are trying hard to keep them out of our git repositires.

Just wanted to share my findings for anybody with the same problem :)

fgrosse avatar Apr 19 '16 12:04 fgrosse

So, what about we don't use zip files and use Github's ability to create a tarball instead? Would that solve the problem? I saw in the API that GitHub now provides a tarball_url along the zipball_url: https://developer.github.com/v3/repos/releases/

This would involve changing GitHubDriver to either respect a user's preferred format wishes or download tarballs by default. Any opinions on that?

cvigano avatar May 03 '16 10:05 cvigano

@cvigano wouldn't tarball have exactly the same issue regarding checksums ?

stof avatar May 03 '16 10:05 stof

@stof you are right of course, as gzip automatically writes the filename and modification time into the archive. I forgot to look that up before commenting here. So yeah, disregard my comment and sorry for the noise?

cvigano avatar May 03 '16 10:05 cvigano

I've also noticed that when re-creating the tar/zip, satis loses the executable flag on files, e.g. https://github.com/brianium/paratest/tree/master/bin these get reset, however if you download the zip or clone the repo from github, those flag are set.

oligriffiths avatar May 03 '16 14:05 oligriffiths

@oligriffiths Seems to be a known issue in Satis but not really related to the issue here. ;-)

mbrodala avatar May 03 '16 14:05 mbrodala

Cool, just saying the direct download from github would be a solution

oligriffiths avatar May 03 '16 14:05 oligriffiths

I'm having this issue with .tar archives specifically. Please update the issue title.

boneskull avatar Oct 11 '16 00:10 boneskull

Is there an existing issue to move the hashing also to sha-2 instead of md5 or sha1? I'm a bit surprised that composer as a newish tool is using an outdated and deprecated hash for verification

pwolanin avatar Nov 30 '16 14:11 pwolanin

Regarding striping metadata from ZIP files, look at the implementation in: https://anonscm.debian.org/git/reproducible/strip-nondeterminism.git/tree/lib/File/StripNondeterminism/handlers/zip.pm from strip-nondeterminism: https://packages.debian.org/sid/strip-nondeterminism it also contain handlers (code to strip metadata) from other archives/file-formats: https://anonscm.debian.org/git/reproducible/strip-nondeterminism.git/tree/lib/File/StripNondeterminism/handlers/

emanuelb avatar Dec 05 '16 01:12 emanuelb

So I have wasted hours on finding out why I am getting these warning:

The checksum verification of the file failed

Are there any plans to fix this?

Ilyes512 avatar May 30 '18 14:05 Ilyes512

Don't know if anyone finds this interesting but I usually use "faketime" to create consistent archives:

http://manpages.ubuntu.com/manpages/trusty/man1/faketime.1.html

rgpublic avatar Jul 24 '18 16:07 rgpublic

Could somebody please add a little more context regarding when exactly this problem occurs? What are these "multiple servers" from the OP, and how do they create the archives?

For the record, I am able to create a ZIP file from the https://github.com/symfony/symfony repo that has the same checksum as the ZIP archive downloadable on the GitHub releases page. Try:

TZ=America/Los_Angeles git archive --format zip --prefix=symfony-5.0.2/ v5.0.2 | sha384sum

Edit:

To be a bit more elaborate myself, here's an example for doctrine/cache 1.10.0 as a random choice.

In a composer.lock file, I've got this...

{
            "name": "doctrine/cache",
            "version": "1.10.0",
            "source": {
                "type": "git",
                "url": "https://github.com/doctrine/cache.git",
                "reference": "382e7f4db9a12dc6c19431743a2b096041bcdd62"
            },
            "dist": {
                "type": "zip",
                "url": "https://api.github.com/repos/doctrine/cache/zipball/382e7f4db9a12dc6c19431743a2b096041bcdd62",
                "reference": "382e7f4db9a12dc6c19431743a2b096041bcdd62",
                "shasum": ""
            },
...

So:

$ wget -q https://api.github.com/repos/doctrine/cache/zipball/382e7f4db9a12dc6c19431743a2b096041bcdd62 -O gh.zip
$ sha384sum gh.zip
2251d87b5a288ba10edbbf9c22540cc63ba2e8636a9b6460407b5a55252b5cdc6f89fe9f683cc922d1237035d25d3fcf  gh.zip
$ git clone https://github.com/doctrine/cache.git
$ cd cache
$ TZ=America/Los_Angeles git archive --format zip --prefix=doctrine-cache-382e7f4/ 382e7f4 | sha384sum
2251d87b5a288ba10edbbf9c22540cc63ba2e8636a9b6460407b5a55252b5cdc6f89fe9f683cc922d1237035d25d3fcf  -

Yes, the America/Los_Angeles timezone is a GitHub peculiarity. But I guess Packagist has to clone the repo anyway, so it might also generate and include checksums for the dist archives?

mpdude avatar Jan 03 '20 09:01 mpdude

But I guess Packagist has to clone the repo anyway

No it does not. For github repos, the GithubDriver extracts all data from the API.

stof avatar Feb 20 '20 17:02 stof

Hey @peff 👋🏼,

you've been very helpful back in 2017 over at Homebrew/homebrew-core#18044, where the Homebrew team had to deal with checksums for GitHub repo archives that suddenly changed. To my understanding, this was due to an internal fix in Git and how it exports archives.

Counting on your expertise and hoping that you're still in some way affiliated with GitHub, and also referring to your comment regarding byte-stable tarballs, can you tell us...

  • If it reasonable to assume that checksums of zipballs (like https://api.github.com/repos/doctrine/cache/zipball/382e7f4db9a12dc6c19431743a2b096041bcdd62) are stable over time?
  • What might be possible reasons for a change?

Thanks!

Update for clarification: Obviously I gave one reason myself why the checksum might change, namely changes to the way Git exports and packages archives. But since you mentioned that GitHub might take measures to guarantee byte-stable archives, and given the fact that also Homebrew seems to rely on checksums of GitHub archives (arbitrary example), we might consider the possible fall-out of a checksum change to be big enough that its not going to happen by accident 😉 . Doest that make sense?

mpdude avatar May 27 '21 11:05 mpdude

Counting on your expertise and hoping that you're still in some way affiliated with GitHub, and also referring to your comment regarding byte-stable tarballs, can you tell us...

  • If it reasonable to assume that checksums of zipballs (like https://api.github.com/repos/doctrine/cache/zipball/382e7f4db9a12dc6c19431743a2b096041bcdd62) are stable over time?

  • What might be possible reasons for a change?

Not much (nothing?) has changed since that comment. Zipfiles are treated the same as tarballs: generated by git-archive on the fly. We did talk about cementing byte-for-byte output at the time of tagging, but it's complicated at scale. And it wouldn't help with the example you gave anyway, which is requesting a non-tag. The only reason for a change would be Git's output changing, which itself would likely be caused by a bugfix.

given the fact that also Homebrew seems to rely on checksums of GitHub archives (arbitrary example), we might consider the possible fall-out of a checksum change to be big enough that its not going to happen by accident

Yeah. I think the situation remains "changes should be very rare and avoided if possible, but technically there are no promises".

peff avatar May 27 '21 16:05 peff

Thanks for the clarification. This to me seems like a no-go as it means there is a chance a git or github change would render virtually all composer.lock file in existence invalid and uninstallable due to security warnings. Even if it's very rare it seems like a pretty bad disruption to the whole ecosystem I'd rather avoid.

Seldaek avatar May 27 '21 19:05 Seldaek

Jordi, thank you for getting back to this (c)old case so quickly.

The reason why I was browsing this and related issues (#5940, #4022) is that I was hoping that there would be an (easy?) way of making sure that the code that we see and install is the same on our local development machines, on CI and in production.

I cannot really say if this is already addressed by the composer.lock file, or if another level of verification on the dist file level would be helpful. It seemed to me that if Packagist can derive and provide a checksum for the dist archive at the time a package is added to the registry, and that checksum is verified and written into my composer.lock file during composer update on a dev machine, it would give me some more peace of mind knowing that the code that will be downloaded and used by automatic processes down the road (CI, deployments etc.) will be exactly the same.

I have no experience when it comes to designing security for systems like this, so I cannot even tell what kind of attacks this would prevent or at least mitigate... Maybe a misbehaving package owner trying a targeted attack, serving different dist versions from their own URL? Mirrors providing modified versions of dist archives?

On the other hand: As long as we're downloading GitHub-provided ZIP archives over HTTPS directly from GitHub, and since we can trust GitHub as a platform, and since they create the ZIPs from the corresponding commit ref, just having the reference in the .lock file should be enough.

So, Jordi, do you still recall what exactly your motivation was for opening this issue? Were it security/integrity concerns?

Regarding possible checksum changes: Yes, I see the risk and I understand that users would probably be yelling at Composer/Packagist. OTOH, since at least Homebrew would be affected as well, there would probably be a lot of visibility and support by all parties involved to sort such issues out. And anyway, in my mind, all this would be opt-in (or out?) by a switch like composer install --verify-checksums.

But again, before we're discussing such technical details I'd need to understand what benefits a simple checksum-based approach would bring security-wise.

mpdude avatar May 27 '21 20:05 mpdude

This isn't opt-in or opt-out, all existing Composer versions if they have a checksum defined will verify it and if it fails they throw. So that's why we disabled it for github early on as it wasn't stable enough.

On the other hand: As long as we're downloading GitHub-provided ZIP archives over HTTPS directly from GitHub, and since we can trust GitHub as a platform, and since they create the ZIPs from the corresponding commit ref, just having the reference in the .lock file should be enough.

That is correct I'd say security wise it doesn't add a ton for GitHub, but for other less reputable sources it wouldn't hurt to have. Then again last I checked ~97% of packages were on GitHub, so all in all I'm not sure what is worth spending resources on.

Seldaek avatar May 27 '21 21:05 Seldaek

Gave this some more thought yesterday, and I'm thinking maybe what we can do is this:

  • Add a new checksums key which would be a dictionary instead of the current sha1sum-or-bust, which isn't very future proof.
  • checksums.sha256 or similar would provide a simple hash of the zip, hoping that remains stable over time
  • checksums.contentSha256 would be a more bulletproof hash of the contents of each file sorted in a stable manner. Something we can for sure reproduce, but that'd be slower to generate. That could be used as fallback in case the other hash should fail to validate. So we'd have a slowdown but not a catastrophic community-wide install failure should github change their output.
  • In addition I would add config options:
    • verify-checksums: { foo/bar: true, acme/*: false, *: true } (defaults to {"*": true}) to allow opting out of verification.
    • require-checksums: { foo/bar: true, acme/*: false, *: true } (defaults to {"*": false}) to allow opting in to requiring verification.

That should allow us to roll this in smoothly and in a future-proof manner. That said, before proceeding any further we should first assess what can be done to share/integrate this together with the TUF work going on in https://github.com/composer/composer/issues/9740

These are just notes, I don't want to promise real progress on this any time soon as we have still tons of other work already planned.

Seldaek avatar May 28 '21 07:05 Seldaek

checksums.contentSha256 would be a more bulletproof hash of the contents of each file sorted in a stable manner.

This should mostly work, but I'll mention one case where it might not: if the dependent package uses Git's export-subst mechanism, then the resulting file contents might not be stable. That mechanism replaces a string like $Format:%H$ in the file using %-placeholders like git-log. Most of those are immutable with respect to the commit, but a few aren't (e.g., %d is the list of refs that point to the commit). The only case I've seen of this in the wild is somebody using %h (the abbreviated sha1). That may change over time if another object which collides with the abbreviation, and Git has to extend it to make it unique.

This should be quite rare, I'd think (again, I've seen only one report total of it messing up Homebrew). And it can perhaps be solved by politely explaining to the project you're hashing why %h isn't a great idea. So I'd probably disregard it, but I thought I'd mention it in the name of completeness. :)

peff avatar May 28 '21 13:05 peff

There is always a way things can go wrong :) Thanks for the pointer, but I would indeed expect this to be rare enough that the combination of using export-subst + github changing its output would only impact a small number of people.

Seldaek avatar May 28 '21 14:05 Seldaek