dvc Support hashings other than MD5

The docs seem to imply that md5 is the only hash users can use. The vulnerabilities of md5 have made it not usable in many organizations which require a certain level of security.

Is the a way to use SHA1 or SHA256 or other hashes instead?

I saw some PRs (https://github.com/iterative/dvc/pull/2022 for instance), but they're closed w/o being merged. What's the progress on that front?

Jan 06 '20 15:01 adrinjalali

Hi @adrinjalali !

In our case md5 is not used for cryptography, it is simply used for file hashing, where it is still suitable (at least I'm not aware of any issues with it in such scenarios), so it is safe to use for htat purpose. The contents of the cache files are not encrypted, you could see it for yourself by opening any file under .dvc/cache. If you need to encrypt your data, you could do that pretty easily on the remote itself (e.g. sse option for s3 remote) or encrypt in your pipeline along the way and only track those encrypted files. We also have https://github.com/iterative/dvc/issues/1317 for in-house encryption, please feel free to leave a comment there, so it is easier for us to estimate the priority. Would the cloud-side encryption be suitable for your scenario?

Jan 06 '20 18:01 efiop

I understand these hashes shouldn't be the only defense layer against a malicious player, but it is still a very useful thing. Imagine a malicious player who has write access to the storage (s3 bucket in this case). They can take a model file, modify it in a malicious way in a way the md5 is kept unchanged, and upload the new one. The rest of the pipeline now won't detect a change in the file, and will serve the new wrong model.

Again, I understand this should not be seen as the only way to secure the models, but in places where software goes through security audit before being used, md5 is one of the first things they're gonna use to reject it.

Another point is that having a SHA is not expensive or complicated at all, so why not include it?

Jan 07 '20 12:01 adrinjalali

I agree with @adrinjalali, using MD5 to check for integrity is far from ideal.

As SHA-1 is starting to show collisions as well, DVC could join git and replace the hashing function to SHA-256.

Jan 07 '20 19:01 JavierLuna

@adrinjalali @JavierLuna Thanks for the info, guys! We will support other hash functions eventually, it just wasn't the focus for us, but we will reconsider it. We've had an attempt from an outside contributor https://github.com/iterative/dvc/pull/1920 , but he wasn't able to finalise it :slightly_frowning_face: If anyone is interested in contributing this feature(similar to #1920, where sha is a opt-in config option for now), we will be happy to help with everything we can.

I'll leave this issue opened in addition to https://github.com/iterative/dvc/issues/1676 , as this current one is about simply replacing md5 with something else (possibly as an option), but #1676 is originally about some custom hash functions.

Jan 07 '20 20:01 efiop

I'd like to help with both issues! Maybe I'll try working on #1676 and set the default hashing function to sha-256?

Jan 07 '20 21:01 JavierLuna

@JavierLuna That would be great! :pray:

Maybe I'll try working on #1676

Let's start with the current one, similar to how #1920 does it. Custom hashes could be implemented later, esp since we don't really know what type of hash creator of #1676 wants.

set the default hashing function to sha-256

Please don't set it as a default one, it should be configurable. Md5 should stay default for now. If the feature is implemented correctly, we will be able to switch to sha-256 later using 1-line PR :slightly_smiling_face:

Please let us know if you have any questions :) FYI: we also have a #dev-general channel in our discord, feel free to join :slightly_smiling_face:

Jan 07 '20 22:01 efiop

Also, we need to remember that changing default checksum will result in whole cache recalculation for users upgrading their DVC version, and this could bring a lot of issues. If having md5 is severe, the change of default checksum should probably be left for major version release.

Jan 08 '20 10:01 pared

It might be interesting to use xxhash: https://cyan4973.github.io/xxHash/ https://pypi.org/project/xxhash/

which is quite a bit faster than md5 (although like md5 it is not cryptographic, but that's fine for the dvc use case).

As a reference, I have a file util_hash in my ubelt library that supports generalized hashing. I may take a stab at integrating xxhash in a future PR.

Sep 29 '20 14:09 Erotemic

I'm currently running a VERY long dvc add, and I wanted to ping again to add some steam to getting xxhash incorporated to dvc.

XXhash is quite a bit faster. I've done some benchmarks:

https://github.com/Erotemic/ubelt/blob/master/dev/bench_hash_file.py

You can run these yourself to verify on your hardware, but this is what I got on mine:

    xxHash is much faster than sha1, bringing computation time down from .57
    seconds to .12 seconds for a 387M file.

If the pattern holds, that means switching to the xxh64 hasher would give a 4.75x speedup when adding files.

Jan 23 '21 20:01 Erotemic

DVC should only rely on cryptographic hash functions that are collision resistant, which would exclude md5 and xxhash. Otherwise it's easy to get the wrong file from a hash.

Jan 23 '21 20:01 jtrakk

Don't confuse collision resistance with cryptographic security. DVC does not need a cryptographic hash function, but it does need a collision resistant one.

Many if not most datasets are small enough where I would expect xxh64 would be collision resistant enough. Either way, from a design perspective the hashing function should be configurable if/when we find better hashing algorithms.

AFAIK it is not necessary for a hash function to be cryptographic for it to be collision resistant.

https://en.wikipedia.org/wiki/Collision_resistance

For instance, just because it is easy to find data that would collide, doesn't necessarily mean its likely for those collisions to be valid files.

How collision resistant is xxhash? This thread has some insights:

https://github.com/Cyan4973/xxHash/issues/165

Jan 23 '21 21:01 Erotemic

If dvc tells me I got "data.csv with checksum 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4", I want to be confident that I have the right file, not one that's forged or unlucky.

wiki lists some features to consider.

The ideal cryptographic hash function has the following main properties:

it is deterministic, meaning that the same message always results in the same hash

it is quick to compute the hash value for any given message

it is infeasible to generate a message that yields a given hash value (i.e. to reverse the process that generated the given hash value)

it is infeasible to find two different messages with the same hash value

a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (avalanche effect)

I think all of those properties are desirable in DVC. There are fast and strong hash functions available for that purpose and DVC can use them. Some are even in the Python standard library.

Jan 24 '21 00:01 jtrakk

Right, dvc only needs some and not all of those properties. Again don't confuse collision resistance with cryptographic security.

it is deterministic, meaning that the same message always results in the same hash

xxhash has this

it is quick to compute the hash value for any given message

xxhash is better at this

it is infeasible to generate a message that yields a given hash value (i.e. to reverse the process that generated the given hash value)

most dvc use cases do not care about this property.

it is infeasible to find two different messages with the same hash value

most dvc use cases do not care about this property.

a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (avalanche effect)

xxhash and many other non-cryptographic hashing functions have this property.

What is not listed in here is that any dvc hash function should have the property: for any two files A and B, the probability that is low p(h(A) = h(B)) < threshold. And xxhash, md5, and sha256 have this property. Perhaps xxhash will give collisions on HUGE amounts of data, but the normal terabyte use case will be fine.

If you are worry about forged files, then yes you should configure dvc to use a secure hashing function (like sha256, and certainly not md5). So as it stands dvc is vulnerable because md5 is vulnerable. But I don't care about forged files. Are you worried about adversaries? If you are consider signing your data with gpg keys. That will be more secure.

Jan 24 '21 00:01 Erotemic

Adversaries are a thing which makes DVC as it is now vulnerable. I would love to be able to recommend DVC in teams and orgs I work with, but as it is now, I can't. There is a real need for many organizations that their tools are secure and the hash issue in DVC is something security people notice quickly.

I think the moral of this issue is that there are quite a few users who are requesting this feature, so it would be nice if it were to be prioritized 🤷

Jan 24 '21 08:01 adrinjalali

This issue is on our roadmap and will likely be addressed at the same time as https://github.com/iterative/dvc/issues/829. When we have support for chunking in DVC, we will move away from MD5 for identifying chunks, and will use some other cryptographically secure hash. SHA256 is one option, but we are also considering faster options (like the Blake variants).

Jan 24 '21 09:01 pmrowla

@adrinjalali What is the threat model that DVC is realistically insecure against?

If an organization is hosting a DVC cache on its internal servers, then an adversary would need to gain access to those servers before they could forge a potentially malicious file that has the same hash as an expected file. If the adversary has access to your internal servers, your organization has bigger problems. I suppose this is more of a real issue for very large corporations where insider threats are more of an issue, but even so, controlling write permissions on the cache should mitigate most concerns.

Is there an attack or threat model that I'm not considering?

@pmrowla I hope that whatever hashing algorithms are being discussed are just for the defaults and in general the hashing algorithm will be configurable. In use cases like mine where everything is behind protected servers, I really don't care about having a cryptographically secure hashing algorithm. I just want the fastest one with a collision probability less than an epsilon threshold.

Jan 25 '21 15:01 Erotemic

Yes, if the adversary has access to the servers that's a big issue. But this is about having multiple levels of security in your system, instead of only relying on your network security. Each component gets audited separately about their attach surface, and MD5 (or any other weak hash) is just a red flag raised by auditors.

This is about an adversary being able to change something w/o it being easily noticed. And changing a file while keeping the same hash, is a really hard one to detect, if the hash is weak.

Jan 25 '21 16:01 adrinjalali

I would like to add that most government systems require FIPS 140-2 crypto. All insecure hashes are excluded (including MD5) as well as some secure hashes that are are not "blessed" by NIST. SHA-2 family of hashes are included in FIPS 140-2. For DVC to have a chance of running on those systems you need to a least support a hash that is in the FIPS 140-2 list of approved ciphers. It does not matter to the auditors how you use the hash (e.g., if you don't care about preimage resistance), you still will not pass the audit. Furthermore MD5 (and others like xxhash) are simply not implemented in OpenSSL in FIPS 140-2 mode and thus might not have a verified and efficient implementation available to python.

Jan 26 '21 10:01 ktarplee

@ktarplee Great points!

Jan 26 '21 12:01 efiop

Furthermore MD5 (and other like xxhash) are simply not implemented in OpenSSL in FIPS 140-2 mode and thus not available to python as an efficient implementation.

@ktarplee xxhash has a super efficient python implementation: https://github.com/ifduyue/python-xxhash-cffi That's how I got the times I cited in the above benchmark. Perhaps you meant are not available to Python as a NIST-compliant hash? In that case yes, and I would want to reiterate that nobody should be using xxhash or md5 if they actually need a cryptographic hashing algorithm.

It does not matter to the auditors how you use the hash ... you still will not pass the audit.

I think this is understood by people in this conversation, but for readers I want to explicitly differentiate "passing an audit" from "causing issues in real world applications". If you will fail an audit just because you use a non-cryptographic hash at any point in the program, that is a problem with the audit methodology and not the actual security of the stack. Yes, there is something to be said for "multiple layers of security", but in reality the government requirements discussed here surmounts to nothing more than lazy auditing where they can throw a red flag because they matched some criterion on a checklist. This sort of audit does not consider the context in which the non-cryptographic hash is used and will throw any number of false alarms to maximize the number of real security threats that are found.

Under these lazy audit criteria, you shouldn't even be allowed to use Python dictionaries, because they use a non-cryptographic hashing scheme:

https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Python/pyhash.c#L39

And yes, there are even security implications to this: https://static.usenix.org/event/sec03/tech/full_papers/crosby/crosby_html/ (some of which listed here have been mitigated).

But most government projects don't disallow the use of Python. Instead they simply compartmentalize the security risk such that an adversary simply doesn't have access to it.

Note that the use-case in DVC is a little different. Security of the hashes is a concern if you were to host a dvc cache publicly. But I balk at the idea that something would fail an audit just because it uses a non-crypto hash without considering the context in which that non-crypto hash is used. My goal in writing this is (1) to make sure people understand there is a bit more nuance that goes into situations that really require a cryptographically secure hash and (2) to make sure the DVC dev's don't force us to use a crypto hash (although a crypto hash should definitely be the default!) and will let us configure DVC to use something faster like xxhash if our use-case and threat model allows for it.

Jan 26 '21 16:01 Erotemic

(2) to make sure the DVC dev's don't force us to use a crypto hash (although a crypto hash should definitely be the default!) and will let us configure DVC to use something faster like xxhash if our use-case and threat model allows for it.

Often when a tool allows users to configure it for a lenient environment, the low-threat configuration gets propagated into higher-threat environments. Configurability can be a problem when it can get non-experts into trouble. For example, a user googles "why does hashing take a long time?" and copies the answer from Stack Overflow without consulting a security expert. Configurability increases the dimensionality of a situation, sometimes enough even to trip up experts. For example, high-security user Alice accepts files from low-security user Bob, who's been using insecure protocols, thereby compromising the integrity of Alice's data.

FWIW my own informal timings:

# ~1GB data file
$ for prog in sha256sum md5sum b2sum xxh64sum b3sum; do echo $prog ; (command time -f %e $prog data >/dev/null) ; echo; done
sha256sum
6.58

md5sum # insecure
2.64

b2sum
2.10

xxh64sum # insecure
0.22                                                                   

b3sum
0.06

Jan 26 '21 19:01 jtrakk

@Erotemic @ktarplee @jtrakk an excellent discussion guys, very interesting to read!

I find this link mentioned above very useful and the way Git handles the migration to SHA-256 - https://stackoverflow.com/questions/28159071/why-doesnt-git-use-more-modern-sha/47838703#47838703

It feels also there is no one fits all solution? Or even if we agree on a hash we'll use in DVC 3.0 we should probably abstract it out like Git does. And/or allow users plug their own implementation after all.

Jan 27 '21 02:01 shcheklein

If you are going through the process of parameterizing the hash algo, you still ought to enforce some set of vetted algorithms/implementations. Compared to the difficulty of preimage/collision attacks, it's far too easy to compromise either the configuration which sets the hash function, or tweak the metadata such that it accepts a weaker hash.

MD5 is the weakest of the hashlib available algos, but will likely be supported by DVC for a long time, being the current default. So imagine once the default is upgraded to SHA256, a .dvc file is encountered with a valid-looking md5 checksum. Naively, the software will parse - md5: 699dc9e031aaf17f19b4e7ccff345002, select md5 from hashlib, and verify. All looks A-OK. But this is a recent codebase, and it's using SHA256 by default, and in fact both the source file and .dvc metadata are malicious. There should be a config which enumerates the allowed algorithms.

Re: xxHash - that's neat, but if you are looking for performance, blake3 provides arguably "good enough for DVC purposes" (~10x MD5 and SHA256), but with the huge advantage of being an actual cryptographic hash and based on the BLAKE platform, with a lot of history behind it (itself based on ChaCha). Blake3/Bao also supports a streaming protocol, which while currently in its infancy, would be extremely useful for incremental hashing of large files a) you get a huge speedup when two files diverge and b) you could conceivably have an append-only file format and de-duplicate earlier blocks.

Apr 27 '21 15:04 xkortex

My 2c:

IMO configurable hash algorithm opens a whole can of worms in terms of security, maintenance and documentation. It can be supported architecture-wise and the algorithm could be updated with a simple patch, but having an option to change this at the user level is not a good practice. It's a decision of indecision. Even MD5 should not be available for the new repositories.
Using the same algorithm with Git have some benefits, like copying small experiment artifacts to Git objects as-is. Also we can be sure that as long as Git is maintained, the hash algorithm will be updated and secure.

Apr 27 '21 17:04 iesahin

I agreed that the fewer degrees of freedom, the less complexity there is. However I still see two different points of parameterization - within the DVC codebase, and user-configurable. The former just implies it would be straightforward to swap out the hash algo used by the codebase - i.e. minimize reliance on hard-coded values. I think this ought to happen regardless, as MD5 should be sunsetted. Replacing MD5 with SHA256 still implies the ability of DVC to read a metadata checksum name and value and validate it.

It's a second discussion whether to expose that ability to user-configuration.

Personally, I don't that much trouble with exposing this as a config. The least-effort route is just accept anything in algorithms_guaranteed and pass it to hashlib.new, {'blake2b', 'sha3_256', 'blake2s', 'shake_128', 'sha3_384', 'sha384', 'sha512', 'sha3_224', 'sha1', 'md5', 'shake_256', 'sha256', 'sha3_512', 'sha224'} on 3.6, the minimum version supported by DVC. Pretty much offload the maintenance and security burden to Python org XD.

Catch the ValueError: unsupported hash type foobar and make it slightly more user-friendly and you're done. Basically just something like:


try:
    hasher = hashlib.new(name) 
except ValueError: 
    raise DVCConfigError('"{}" is not a valid hash algorithm name. Acceptable choices are {}'.format(name, hashlib.algorithms_guaranteed))  # or some whitelist

Documentation: "DVC default_hash can be configured to be one of the following: ..."

Security: it's already using MD5. Hard to go worse from there (if ensuring usedforsecurity=True).

Allowing pluggable hash algos, e.g. to support Blake3, yeah I can see that being a can of worms.

Apr 28 '21 00:04 xkortex

Let me give an example:

Suppose Alice and Bob share a remote. Alice configures to use Blake3, Bob wants to use SHA256.

Now, we don't have any metadata in remotes that show which hash algorithm is used. They are simple directory trees with a deterministic structure. How will we solve this? We need to keep metadata in remotes. That means all remote and cache-related code should be updated just to support this.

Alice has a local cache in Blake3, has a remote shared with Bob in SHA512, Bob has a local cache in SHA256. When one of them wants to send a file to the other, a conversion is needed and where, and why?

Their remote is on a Windows server and suppose they put it to a C:\Very\Very\...\Deep\Dir\ and guess what, they discover that Windows has a restriction of 1024 characters for paths. 😄 Or, their directory is exactly 510 character deep and it works for normal files, but it fails for .dir files and we can't understand why.

DVC's all data sharing and versioning code needs an update and all these algorithms will need separate tests. We'll have to ask which hash algorithm do you use? in new tickets. We'll replace each "MD5" occurrence with at least a few sentences in the docs. If some of these algorithms will have export restrictions to some countries, DVC will have to maintain separate branches/controls/whatever for each, because when you start to support n algorithms, you just can't go back and say err, we actually didn't mean this one.

Hashing part should be abstracted away and if a new one appears just better for DVC, we should be able to use it without much change in the codebase. I also fully support the forks modifying this layer for their own use but allowing users to change the algorithm in .dvc/config is not something I'd dare.

Thank you.

Apr 28 '21 01:04 iesahin

The choice of MD5 is curious. Git is build upon the (also broken, but alas, far-far more secure SHA-1). Why choose a hash that has been known to be broken from day 0???? One could start to being suspicious.

The the blasé urgency that this issues has been treated with here is again verging on suspicious.

We should upgrade to Blake 3 as a matter of security urgency.

MD5 is a security critical failure and should be addressed as such!

--

CC: @josecelano, @yeraydavidrodriguez

Oct 03 '21 09:10 da2ce7

@iesahin

Let me give an example:

Suppose Alice and Bob share a remote. Alice configures to use Blake3, Bob wants to use SHA256.

We need to stop having a completely critical security vulnerability as a matter of urgency. The goal of supporting multiple hashes is nice, however that is a far secondary concern from moving away from MD5.

We need it pick any not-broken hash and do a one-way move as a matter of desperate urgency. Later we can think about if we want support of multiple hashing algorithms. That is a different topic to addressing this security vulnerability.

DVC's all data sharing and versioning code needs an update and all these algorithms will need separate tests. We'll have to ask which hash algorithm do you use? in new tickets. We'll replace each "MD5" occurrence with at least a few sentences in the docs. If some of these algorithms will have export restrictions to some countries, DVC will have to maintain separate branches/controls/whatever for each, because when you start to support n algorithms, you just can't go back and say err, we actually didn't mean this one.

You are simply wrong here. There are no practical export restrictions on crypto in this world anymore. This fear is about 20 years out of date. Remember when the wonderful people published the PGP physical book, with the source code printed inside?

Oct 03 '21 09:10 da2ce7

We need to stop having a completely critical security vulnerability as a matter of urgency. The goal of supporting multiple hashes is nice, however that is a far secondary concern from moving away from MD5.

I agree completely.

There are no practical export restrictions on crypto in this world anymore. This fear is about 20 years out of date.

Laws change. That was an example though. More algorithms mean more points to consider, more options require more updates and more maintenance in the future.

Oct 03 '21 09:10 iesahin

Guys, we've been using md5 for historical reasons. Initially, we only had pipeline management where md5 was used as a checksum and not a hash, so it was perfectly suitable for the task. Data management grew from it and kept using md5. We are well aware that md5 is not a great hash, and will be switching to something more suitable (likely some SHA) in 3.0, so stay tuned.

Oct 03 '21 10:10 efiop

dvc dvc copied to clipboard

Support hashings other than MD5

dvc
dvc copied to clipboard