hashdeep icon indicating copy to clipboard operation
hashdeep copied to clipboard

Support xxHash algorithm

Open bhagemeier opened this issue 1 year ago • 4 comments

Hi there,

at Juelich Supercomputing Centre, we've recently been researching convenient tools to generate and verify hash sums of large collections of data. The amounts we're typically talking about are in the area of several TB to PB. We've found hashdeep to be convenient and providing a good interface including parallelisation options that may be important to checksum and verify many small files.

We've also come across the xxHash algorithm, which has been specifically designed to create checksums over extremely large amounts of data.

We have found the commandline tools provided for xxHash to lack some functionality offered by hashdeep. Therefore, we propose to integrate xxHash into hashdeep to improve the support for use cases dealing with extremely large volumes of data. Moreover, we also support the idea of integrating Blake3, as mentioned in #397.

In the spirit of Open Source, we do offer our full support in doing the integration ourselves, but would like to learn about your willingness to include the code in the main branch afterwards. Additionally, if there were good reasons to omit algorithms such as xxHash or Blake3, please let us know about them.

In order to support our request in numbers, here's a comparison of various algorithms supported in hashdeep and xxHash on a 155GB data set of two files.

Tool Duration Speed (approx.)
xxHash 36s 4.3GB/s
hashdeep (default md5 and sha256) 564s 275MB/s
hashdeep (md5) 184s 840MB/s
hashdeep (sha1) 294s 530MB/s
hashdeep (tiger) 272s 570MB/s
hashdeep (whirlpool) 789s 200MB/s
hashdeep (mmap,md5,sha256) 629s 250MB/s

As you can see, xxHash it at least 5 times faster than the fastest algorithm supported by hashdeep.

bhagemeier avatar Oct 20 '22 09:10 bhagemeier