elixir icon indicating copy to clipboard operation
elixir copied to clipboard

Track digests of @external_resources

Open josevalim opened this issue 1 year ago • 10 comments

EDIT: This issue has been updated to mention digest instead of MD5.

Elixir and Erlang/OTP versions

All.

Operating system

All.

Current behavior

Currently we don't track the digests of external resources, this means we may recompile files when nothing really changed, such as in #12056.

Expected behavior

Compute the digest for external resources.

To make sure computing the digest does not slow the compiler process, I have split the compiler process from the Mix process in 9f93d0582a555299aa1a0b982017dd0ae3036181.

josevalim avatar Aug 07 '22 13:08 josevalim

Not to add to the list of problems, but would it be better to use sha1/sha256? Currently, there are tools that one can use to add content to a file and also generate the same md5 checksum: https://github.com/DavidBuchanan314/monomorph

Just trying to give some context before this is implemented. I believe that changing the algorithm and benching everything will be easy enough though.

victorolinasc avatar Oct 03 '22 20:10 victorolinasc

This is not used for security. We need any hashing algorithm and fastest is better. Chance of conflicts for MD5 on actual source code is virtually Zero.

josevalim avatar Oct 03 '22 20:10 josevalim

Hey @josevalim!

I understand that this feature isn't intended to be used for security, but I just wanted to point out one possible risk, in case it falls within a threat model worth considering.

Since these hashes are used to determine when an external resource should be recompiled, a hash collision means being able to avoid recompilation, even if the resource's content changes. As you say, the chances of a conflict are virtually zero, but MD5 is vulnerable to a few different collision attacks that an attacker can leverage as part of a targeted attack, to construct a collision.

This might show up as part of a supply chain attack, where an attacker introduces a backdoor into some file—possibly one defining a security policy—that has been constructed in such a way as to allow for a second, non-backdoored file, with the same hash. Now, if the backdoor is detected, the attacker can pretend to fix it by switching to the non-backdoored file. Since both files share a hash, this change won't trigger recompilation, giving users a false sense of security when applying the patch.

This is a fairly convoluted attack vector, and I'm making no claims as to its practicality, but I wanted to raise it as an example of how using MD5 here may offer less security than not using any hash function at all.

QuinnWilton avatar Oct 04 '22 19:10 QuinnWilton

blake3 is really fast + can parallize. Noticed a 60x speedup on 8core vs sha256 (i guess divide by 8 to remove the multithreaded advantage if that somehow should be discarded in the calculation).

vans163 avatar Oct 05 '22 00:10 vans163

Thank you for the insights @QuinnWilton! Given that external resources do not have to be source code, I agree with your concerns.

We have three options from here I think:

  1. Go forward with a more expensive hash (downside of an expensive hash is that maybe recompiling is faster than the hash)
  2. Make digesting opt-in (and mention it must be used mainly for textual files)
  3. Don't go ahead with this and leave it up for library authors to implement digesting for source files when they find it relevant

josevalim avatar Oct 05 '22 08:10 josevalim

@vans163 we can use anything that is part of Erlang/OTP. I assume that's not the case for blake3, correct?

josevalim avatar Oct 05 '22 08:10 josevalim

BLAKE3 is very fast, but it's still a relatively new construction, and lack of standardization means it isn't natively supported in many places right now. I do see support for BLAKE2 in Erlang/OTP, which is conceptually very similar to BLAKE3, though slightly slower. For reference, the Linux kernel recently switched from SHA1 to BLAKE2 for entropy mixing in its CSPRNG, and saw large performance improvements.

The other two options are SHA2 and SHA3, which are both decent options here, but with SHA3 also protecting against length extension attacks.

QuinnWilton avatar Oct 05 '22 13:10 QuinnWilton

I haven't tested it myself, but I also read online [1] that BLAKE2 is faster than MD5 on most (64-bit) systems. In terms of security, I think that BLAKE2 is "sufficiently safe" as it's comparable with SHA-2 [2]?

[1] https://medium.com/asecuritysite-when-bob-met-alice/blake3-3716708235ac [2] https://leastauthority.com/blog/blake2-harder-better-faster-stronger-than-md5/

dvic avatar Oct 05 '22 14:10 dvic

@vans163 we can use anything that is part of Erlang/OTP. I assume that's not the case for blake3, correct?

It seems it has to be included in openssl first, then itl be a quick journey into upstream OTP https://github.com/openssl/openssl/issues/11613 :(

BLAKE3 is very fast, but it's still a relatively new construction, and lack of standardization means it isn't natively supported in many places right now. I do see support for BLAKE2 in Erlang/OTP, which is conceptually very similar to BLAKE3, though slightly slower. For reference, the Linux kernel recently switched from SHA1 to BLAKE2 for entropy mixing in its CSPRNG, and saw large performance improvements.

If blake2 is parallelizable (to a degree) + cryptographic hash I think that is perfect?

https://www.linuxadictos.com/en/blake3-una-funcion-hash-criptografica-segura-rapida-y-paralelizable.html

Regarding the main differences between BLAKE3 and BLAKE2:

Using a binary tree structure to achieve unlimited parallelism in the hash calculation.
Reducing the number of rounds from 10 to 7.

vans163 avatar Oct 05 '22 14:10 vans163

Thanks everyone for the feedback, this was an insightful discussion. Here is a PR to move the current MD5 to Blake2: https://github.com/elixir-lang/elixir/pull/12173 - We will see how it goes and then I shall update this PR title accordingly. If it works, it means we can automatically digest with Blake2 (it is faster and safer than MD5).

josevalim avatar Oct 05 '22 14:10 josevalim

Weird idea:

Add extra option :hash to @external_resources that would allow user to define the function that will return hash used for comparing if the file was changed and hash will be used only when this is defined. It will allow for comparing files with defined structure as well. So for example imagine that we would have fast parser for .po files, for example only order of magnitude slower than hashing, then we could mark these files:

File 1:

msgid "Foo"
msgstr "Foo"

msgid "Bar"
msgstr "Bar"

File 2:

msgid "Bar"
msgstr "Bar"

msgid "Foo"
msgstr "Foo"

As the same. We could also support true as an additional option that would use best hash function that we have at given point.

hauleth avatar Nov 04 '22 10:11 hauleth

Yes, I have considered this route too. The biggest question is what to do by default, then we can expand on the rest!

josevalim avatar Nov 04 '22 12:11 josevalim

@josevalim my idea was to use current approach (which I assume is comparing mtime of the BEAM file and the external file).

hauleth avatar Nov 08 '22 14:11 hauleth

It will allow for comparing files with defined structure as well.

I'm not necessarily arguing against this approach, but one risk to be aware of is that it opens the door for very subtle parser differential bugs, where deviations between a parser and a hasher cause files with the same hash to be parsed differently.

How big of a concern that is, I don't know, but parser differentials have led to some of the trickiest bugs I've ever encountered, and so I'm always wary about areas where parsing and signing or hashing behaviour can diverge.

As a concrete example, consider hashing JSON files based on their structure. It becomes possible for both of these files to have the same hash, but while parsing differently:

File 1:

{
  "name": "Quinn",
  "admin": false,
  "admin": true
}

File 2:

{
  "name": "Quinn",
  "admin": true,
  "admin": false
}

QuinnWilton avatar Dec 05 '22 18:12 QuinnWilton