hashdeep icon indicating copy to clipboard operation
hashdeep copied to clipboard

Hash of a whole drive

Open larytet opened this issue 7 years ago • 7 comments

My goal is to hash all files on a HDD in 0.5-1T range. What is my bottle neck going to be - CPU or I/O? Does it make sense to try to read and hash the physical sectors on the hard disk, and map the hashes to the files in the end of the process using tool like debugfs?

If my drive is a high end SSD - does it change the equation?

Thanks

larytet avatar Jul 17 '17 04:07 larytet

Found this https://crypto.stackexchange.com/questions/46469/is-hashing-large-files-cpu-or-i-o-bound

larytet avatar Jul 17 '17 05:07 larytet

In Linux there is https://linux.die.net/man/8/debugfs I can read the drive sector by sector, map sectors to files, feed SHAs machines and, eventually, get an SHA for every file on the disk without doing open-read-close. Or so it appears. What do I miss?

What about Windows?

larytet avatar Jul 17 '17 13:07 larytet

My guess is that you will be I/O bound. This is a WAG, however, and not based on any information specific to your system. I also believe, but also don't have any evidence to support, that you will spend more time writing and debugging a system to read sector by sector and then reconstructing files, than you would take just reading the files the regular way.

jessek avatar Jul 18 '17 15:07 jessek

The goal is to run on 100s of 1000s machines and VMs. In my case the performance is critical, development efforts are not.

larytet avatar Jul 18 '17 17:07 larytet

If you have the time, you're welcome to go for it. Please let me know how it goes!

jessek avatar Jul 19 '17 02:07 jessek

@jessek I also have to hash whole drives a lot on Linux, like 3,7 Tb x3 drives full of mixed types of data... Which takes a really long time with md5.

How about implementing some super-fast algorithm, like xxHash for such goal of purely checking for data integrity?

keybreak avatar May 02 '19 20:05 keybreak

If you are CPU bound, you may want to look at xxhash.

xxhash is probably the fastest hash algorithm today. Combined with a filesize, a 64-bit hash is more than enough for non-cryptographic purposes.

HaleTom avatar Sep 17 '19 07:09 HaleTom