Update BLAKE3 to the latest version
Greetings dear @idrassi
You forked the amazing HashCheck Shell Extension, implemented BLAKE3 and other nice stuff, and released version 2.5.0 in 2021.
The latest version of BLAKE3 is 1.5.5 from 2024-11-26. Since 2021, BLAKE3 has had a lot of improvements and fixes. https://github.com/BLAKE3-team/BLAKE3
Please, can you update BLAKE3 inside HashCheck Shell Extension?
Thanks for being awesome and a great developer. Have a very nice day.
Hi!
Version 1.7.0 of BLAKE3 has recently been published. It'd be great if it could be implemented in the programme.
Thanks in advance!
The last week, v1.8.1 was released. I really hope HashCheck would be updated with the latest BLAKE3. https://github.com/BLAKE3-team/BLAKE3/tree/1.8.1 Also, fixing the sort issue in checksum files, but this is secondary.
Hello @idrassi Please consider updating BLAKE3 since it now supports multithreading since 1.7.0. It is considerably faster multithreaded and will unfortunately probably be my reason for switching to a different solution if not implemented. Nothing has beaten your fork for simple BLAKE checksums with a clean UI. If not, I really hope an BLAKE3 update makes its way into your DirHash software, ideally with a front-end similar to HashCheck's eventually. Thanks!
@redactedscribe I have integrated the latest version of BLAKE3 and I did tests with the new TBB feature that is supposed to enhance performance: when hashing a large file (6 GiB), CPU usage goes from 4% to 85% which is expected with multithreading but it takes the same amount of time to perform the hash.
I checked everything and it is the new BLAKE3 function blake3_hasher_update_tbb that implements TBB that takes the same amount of time while consuming all available cores usage.
HashCheck implementation is standard: it processes the file by chunks of 256KiB. These chunks are processed sequentially (which mandatory for hashing) and each chunk is fed to the function blake3_hasher_update_tbb . This function is supposed to process the 256 KiB faster since it uses all CPU cores but in my tests it doesn't do so and it somehow blocks.
I used the official one TBB binaries and libs from https://github.com/uxlfoundation/oneTBB and I followed the documented integration approach: define BLAKE3_USE_TBB and disable exception handling when building blake3_tbb.cpp.
All these tests were conducted with BLAKE3 the only algorithm active. So this not caused by some kind of side effect from other algorithms. I also disabled multithreading in HashCheck just to see if it has any effect on the issue but issue was still there.
For me, it is clear that there is an issue in the oneTBB library used by BLAKE3 library in blake3_hasher_update_tbb, at least on Windows x64 (maybe it is ok on Linux).
I cannot spend more time on this.
I will integrate the new BLAKE3 version in HashCheck without the TBB feature and I will publish a new version.
Concerning DirHash, I doubt there will be any gain but I will check when I have time.
I sincerely appreciate your efforts @idrassi. It's a shame there will be no time gain by updating BLAKE3 in this case. Hopefully someone can figure out the issue you ran into and submit a PR for you to merge in. I may still end up switching to another hashing utility until then. Nonetheless, thank you for the BLAKE3 version bump and not entirely forgetting about this convenient software.
Thanks for your hard work implementing the new version, @idrassi .
@idrassi
First, thank you very much 😄 You're awesome! I read all your description and I have a doubt, you said that:
I have integrated the latest version of BLAKE3 and I did tests with the new TBB feature that is supposed to enhance performance: when hashing a large file (6 GiB), CPU usage goes from 4% to 85% which is expected with multithreading but it takes the same amount of time to perform the hash.
Those tests you did, it was New Version (I think v1.8.2) with and without TBB feature, or it was 2021 (HashCheck) version vs. the New Version (with TBB)???
Because I don't think the version from 2021 takes the same amount of time performing hashes that the newest version (from 2025-04) regardless of the TBB issue/bug (that makes "CPU usage goes from 4% to 85%"). I mean, there should be many improvements of the BLAKE3 algorithm in 04 years.
I will integrate the new BLAKE3 version in HashCheck without the TBB feature and I will publish a new version.
And we are very happy for your comments and will be waiting the new version. Thanks again 🏅
I ran tests with the new version of BLAKE3, both with and without TBB enabled.
In version 1.8.2, there’s no significant change: after reviewing the code, I found that aside from the TBB integration, the main updates are minor assembly optimizations and the addition of enhanced AVX-512 support. Therefore, we shouldn’t expect a noticeable performance boost simply by upgrading the BLAKE3 library, at least, not on my systems, which don’t have AVX-512 instruction support (so I can’t evaluate that aspect).
Here’s what I’ve realized about TBB’s behavior:
TBB only delivers performance benefits when hashing an entire file in one pass, exactly as described in the BLAKE3 documentation. The TBB API comes with substantial overhead during initialization and finalization, so for maximum performance, TBB should be invoked just once, processing the file in a single call to avoid repeated setup/teardown costs.
However, tools like HashCheck (and similar utilities, such as DirHash) process files in chunks, since it’s not always feasible, or efficient, to load large files (multi-GB) entirely into memory. In practice, reading a file in manageable chunks is often faster and more scalable than reading hundreds of megabytes (or more) at once.
This means that for files larger than the chosen chunk size, TBB initialization and finalization are triggered repeatedly, largely negating any performance benefit.
Currently, HashCheck uses a chunk size of 256 KiB, a compromise chosen to optimize I/O performance across both SSDs and HDDs.
I experimented with increasing the chunk size in HashCheck to see if reducing TBB’s overhead would improve overall performance. For reference, my tests used a 189 GiB file on an SSD NVMe drive with a Core i9-13900HX CPU.
| Chunk Size | Time (with TBB) | Time (without TBB) |
|---|---|---|
| 256 KiB | 98 seconds | 98 seconds |
| 1 MiB | 71 seconds | 64 seconds |
| 2 MiB | 69 seconds | 61 seconds |
| 5 MiB | 69 seconds | 67 seconds |
| 10 MiB | 93 seconds | 88 seconds |
| 50 MiB | 134 seconds | 123 seconds |
As you can see, there is a noticeable improvement when using chunk sizes between 2 MiB and 5 MiB, but this advantage disappears with larger chunks due to their negative impact on I/O performance. I haven’t tested on HDDs, but I suspect that a 2 MiB chunk size would significantly degrade performance compared to smaller chunks on spinning disks.
The non-TBB version consistently outperforms the TBB-enabled version, primarily due to the overhead incurred for every chunk. However, if a file fits entirely within a single chunk, the TBB version can be 5x to 10x faster than the non-TBB version.
One possible approach would be to increase HashCheck’s default chunk size to 2 MiB. This could deliver a 30% speed improvement on systems with fast SSDs. For older drives, though, the benefits are uncertain and performance could even worsen in some scenarios.
Any thoughts or suggestions on this direction?
Given these findings, there is little practical merit in adding TBB support at this point, since most files will not fit within a single chunk and therefore cannot benefit from TBB’s parallelism. For the vast majority of use cases, where files are processed in smaller segments, the overhead outweighs any theoretical advantage.
The main reason why I became aware of BLAKE3's multithreading was a comment by a user of file managers commented:
multithreading support is now available. ... Dopus [Directory Opus] 13.15.1, 328MB/sec vs Total Commander 11.55 Release-Candidat[e] 2, 1,5GB/sec
This speed increase may be due to more than just comparing BLAKE3 versions as the file managers may have implemented them differently. Anyway, this information isn't very helpful without information regarding their CPU etc but it implied a significant bump in performance which presumably Total Command now benefits from. Unfortunately Total Commander is closed-source I believe, and the changelog for its version 11.55 only mentions:
Create/Verify Checksums: Use multiple threads for Blake3 checksums (64-bit only, on Windows 7 and newer)
There is a 11.55 demo available if for some reason a black-box comparison of how it performs on your system's 189 GiB file would be a helpful sanity check.
As for your question:
Any thoughts or suggestions on this direction?
If your insights are indeed correct, it sound reasonable to omit TBB support for now as it's probably not the norm that users are hashing large amounts of sub 256 KiB files. However, for users with AVX-512 instruction support, TBB's implementation may actually be beneficial (my CPU also does not support this).
It seems hard-coding 2 MiB would be the best chunk size to use regardless of with or without TBB, but it then should be either dynamically set to 256 KiB if the files reside on an HDD. In addition to / alternatively provide a HashCheck option to manually set the chunk size for the user to decide upon. The last alternative solution would be hard-coding the chunk size to 1 MiB and hoping it strikes the best balance without any additional logic or user options. Options would help with future-proofing, but it all depends on your available time.
When you'll have the time to attend to HashCheck again is unknown, so if there are any changes that could be done now which may benefit in the future those would be welcome. For example: Although likely a large job and outside of the scope of the time you're able to currently commit, if HashCheck was extensible the algorithms could be made into plugins, then updating / fixing BLAKE3 and other algorithms could be done outside of altering HashCheck itself, letting the task be handled by the community instead (which I'm guessing is the subject matter of much of the repo's issues).
@redactedscribe I’ve installed Total Commander 11.55 x64 on the same machine and performed a BLAKE3 checksum on the same 189 GiB file: it took exactly 64 seconds.
This matches the performance in HashCheck when using a 1 MiB chunk size without TBB.
Regarding the user's comment you referenced, it's likely they measured performance using a directory filled with many small files that fit within Total Commander's internal chunk size. That’s the only scenario where TBB delivers significantly better performance.
Based on this, I'm considering the following changes:
- Set the default chunk size to 1 MiB.
- Modify the hashing logic to use BLAKE3 with TBB only if the file size is smaller than the chunk size. Otherwise, use the standard BLAKE3.
- Add a configuration setting for chunk size, offering a list of predefined values for users to select from.
As for the plugin idea, I agree it will be a good addition. I’ll give it some thought and explore how it might be implemented.
I don't have enough knowledge to say whether it's technically correct, but based upon your logic and testing, setting the default to 1 MiB sounds like a reasonable choice. Checking the for the file's size being less than 256 KiB should give some minor wins for mixed content on average so that would be nice, so would be the configurability you mentioned. These changes are very much appreciated. I'm glad you're considering the plugin idea, and I await to see which changes make it into the new HashCheck release.
@idrassi
I saw the interesting tests you made some hours ago. As always, your hard work is really appreciated (like the amazing VeraCrypt that of course I use). I have some opinions about the reports you showed and the messages of @redactedscribe that I agree a lot.
I wish it could be possible if those tests can also be made on a SATA SSD 2.5" and on a HDD (Mechanical/Spinning), optionally on a Flash Device (like USB or microSD). Because with these new tests, maybe the implementation of the "Custom Chunk Size" setting could be omitted.
I agree in making the 1 MiB the new default value of Chunk Size (for SATA/NVMe SSD), but maybe a value of 256 KiB or even 512 KiB can be the best for HDD or Flash Device. It's known that the app can detect if the hashing process is made inside a SSD or a HDD.
In any case, the logic of TBB implementation of only use it if the file size is smaller than the chunk size is perfect!
Thanks a lot for your great development 😃
@idrassi is there any news about releasing a new version with some of the changes and improvements you commented in your tests?
Thank you very much!