OpenColorIO icon indicating copy to clipboard operation
OpenColorIO copied to clipboard

Update hashing algorithm for Processor caches

Open remia opened this issue 3 years ago • 9 comments

Following in #1645 and discussions at the last TSC meeting, I'm opening this PR to get the ball rolling.

As mentioned, the choice of hash function is not fixed, it currently is xxHash (64 bits version, could be extended to 128 bits if needed) but could be updated to something else. One legitimate concern could be the added complexity cost of bringing such a dependency to the build. Any suggestions welcome, std::hash could be a candidate but might not be supported on all our targets (thinking about the std::string_view specialisation).

Another point up for discussion is whether or not we should provide user override, like we already have for the filename hashing function.

Looking for your feedback (also tagging @lgritz and @Shootfast).

remia avatar Jun 01 '22 20:06 remia

This would be a great improvement, thanks for exploring it Remi!

What is the situation regarding namespacing of the xxHash declarations? I notice that xxHash has some related compile flags and there is a namespace-related draft PR on their GitHub, but I didn't have time to study it. Did you investigate that aspect?

doug-walker avatar Jun 05 '22 18:06 doug-walker

If the main use case is to drop the unrequired "cryptographic" features that md5 was needlessly providing us with, have we tried just replacing the hash with something ridiculously simple like the Paul Larson hash function?

unsigned int hash(const char* s, unsigned int seed = 0){
    unsigned int hash = seed;
    while (*s) {
        hash = hash * 101  +  *s++;
    }
    return hash;
}

I'm not sure how prevelant hash collisions are within our regular use cases, but there are plenty of other very simple hash functions that wouldn't require any new libraries - header only or otherwise.

Shootfast avatar Jun 13 '22 19:06 Shootfast

Thanks @doug-walker, I was not aware of that issue and need to have a look later. Also need to check there is no symbol leakage if we were to use that library.

@Shootfast, thanks for the pointer, I'll try to test that as well. I think the main concern here is speed, but we definitively need to have some sense of what kind of hash quality (in terms of lowering collision risks) we are after. If you have any other hash function to suggest or benchmark we should use, feel free to add them here.

remia avatar Jun 20 '22 17:06 remia

Thanks for the Larson hash suggestion Mark, that does indeed look like a simple but good 32-bit hash function. This article says it is better for short keys than long ones, which would be problematic since we only use the md5 for long keys (we tend to use std::hash for short keys). But it's not clear how they draw that conclusion: https://www.strchr.com/hash_functions

However, we should keep in mind that OCIO does not really use the output for hashing. A better term might be fingerprinting. In other words, if two entries generate the same value, OCIO assumes they are identical -- there is no check for collisions (as you would in a hash table algorithm). So if two items generate the same hash value, that would be a serious OCIO bug. Hence, I'm nervous about reducing the size of the hash to 32-bits. The current md5 algorithm is 128 bits and Remi's proposed replacement is 64-bits.

Here is a table that shows the probability of collisions for various hash lengths: https://en.wikipedia.org/wiki/Birthday_attack

But the table assumes that the hashing function is a perfect random number generator and in practice they are not so perfect, so the actual probabilities are going to be higher (possibly much higher). I don't think it's unreasonable to say that someone could have 100 LUT files in a config or generate 100 processors. So the 32-bit row does not really seem to inspire confidence that a collision would not happen (across every config ever generated). I would vote that we stick with a minimum of a 64-bit hash / fingerprint.

I think Mark's larger point is still valid that there might be a 64-bit hash that would not involve bringing in a new library, but the problem is I'm not sure any of us have time to investigate and test the options. And if we make a bad choice, it could result in serious and perhaps difficult to track down bugs.

According to this table, XXH128 (128 bits) is almost as fast as XXH3 (64 bits). So if we decide to use xxHash, it seems like there would be minimal cost to using the safer choice (i.e., to stay with our current length of 128 bits): https://cyan4973.github.io/xxHash/

doug-walker avatar Jul 02 '22 20:07 doug-walker

Thanks @doug-walker that's great information!

I'm also leaning toward using xxHash 128 bits for the same reasons: I'm not an expert in hashing algorithms and this is the project that seems to offer the best performances and the most well supported (and widely used?) at the moment. Maybe I can try to setup some benchmark relevant to OCIO use cases, for example testing how many random LUTs it takes to reach N number of collisions for various hash algorithms (if this is even possible without heap-overflowing), while watching for some performance metrics simultaneously. Would this be useful to make progress on this PR? Any other ideas of things we should test?

From previous discussions about using hash for purposes other than internal caching, like config or processor fingerprinting as mentioned by @zachlewis and @KevinJW, it seems we should all agree on the use cases to make sure we design accordingly. The proposed idea of a callback to override the hash function would probably need to go away for example if we want to share the hash externally, same as making sure the hash generated do not vary per system architecture (granted this is already checked in the unit tests).

remia avatar Jul 04 '22 17:07 remia

Remi, if you use XXH128, I don't think there's a need to generate lots of random LUTs to test it. But if people don't want to use a proven library such as xxHash, then we may need to do some testing. I think the main question at this point is for the TSC to decide if we want to go ahead and add xxHash as a dependency. There is also the issue of hiding the symbols, but I'm assuming that will be possible.

I agree with you it's important to agree on the use-cases. The md5 is already used for Config and Processor fingerprinting and I don't think xxHash would change the current behavior, but it would be great to try and document the expectations. Towards that goal, I've started a Wiki page to document caching and cache IDs in OCIO. I invite people to improve it, and once we're happy with it I'll move it into the official documentation: https://github.com/AcademySoftwareFoundation/OpenColorIO/wiki/Caches-and-cache-IDs-in-OpenColorIO

Remi, as described in the issue you linked to, I know you started this project to speed the performance of a specific use-case. I'm curious how close the xxHash approach is to setting the OCIO_DISABLE_CACHE_FALLBACK env var with the md5 hash. Did you test that? (I still think we should replace the hash regardless, so I only ask out of curiosity.)

doug-walker avatar Jul 04 '22 18:07 doug-walker

I updated the PR to now use XXH3_128bits, interestingly in some cases the new low64 part was equal to the old XXH3_64bits but in other cases not, this can be seen in the commit diff, I'm not sure yet if it means anything.

To reduce the symbols visible in the library, I used XXH_INLINE_ALL which should also give a performance boost to hashing small data, even though it's not our main use case. The symbols I find in the generated (dynamic) library on macOS look like:

nm myinstall/lib/libOpenColorIO.2.2.0.dylib | grep -i XXH | c++filt
0000000000396200 s XXH3_kSecret
00000000001261a0 t XXH3_len_129to240_128b(unsigned char const*, unsigned long, unsigned char const*, unsigned long, unsigned long long)
0000000000125bf0 t XXH3_hashLong_128b_default(void const*, unsigned long, unsigned long long, void const*, unsigned long)

Which I believe should be harmless as "t" means these are not exported? I need to do more research to be extra sure. Without the inlining there is a lot more symbols but again only as "t" and we have the possibility of prefixing them with a custom prefix with XXH_NAMESPACE (it doesn't seem to affect the 3 symbols listed above but all others are correctly prefixed).

The wiki page looks great @doug-walker, but I'm not exactly sure how to make comment on that in Github, I also need to read it more carefully again as it's dense. One thing that already come to mind and I was actually meaning to propose is to deprecate SetComputeHashFunction() and rename it with something more explicit like SetFilePathHashFunction().

remia avatar Jul 05 '22 20:07 remia

Remi, as described in the issue you linked to, I know you started this project to speed the performance of a specific use-case. I'm curious how close the xxHash approach is to setting the OCIO_DISABLE_CACHE_FALLBACK env var with the md5 hash. Did you test that? (I still think we should replace the hash regardless, so I only ask out of curiosity.)

I think I didn't tested that because that would disable the caching of optimised and derived processors if memory serve so the time is lost elsewhere. But I'll try to come up with some numbers when I get time and update here.

remia avatar Jul 07 '22 19:07 remia

I did some more benchmark following @doug-walker suggestion to use OCIO_DISABLE_CACHE_FALLBACK. I had to come up with two different scenarios to show the drawback of disabling the cache because my first use case was always creating different processors so the fallback mechanism was not able to help. For context both case involve creating a display view transform from a display-referred space, for this OCIO v1 ACES-like config I'm using, it means inverting the ODT+RRT and a LMT, then applying them back targeting a different display (different ODT). There are 4 65-cubed 3DLUT in this processor. The two scenarios are:

  1. Each shot use a different CDL inside the DisplayViewTransform. Here the cache fallback is useless because the instantiated processors are different so the cache fallback will never find a match. I updated ocioperf to set the different shots per loop iteration and accumulated the created processors in vectors so that later metrics for derived processors always use the processor corresponding to their current iteration.
  2. Each shot use the same CDL (identity) duplicated on corresponding folder (so that the outer processor cache key doesn't detect the fact that they are the same). In this case the fallback helps speed up the creation of optimised / CPU / GPU processors. Uses the ocioperf adjustments as above.
Case 1. - 25 different shots / CDLs

main branch 11072022

Create the config identifier:           For 50 iterations, it took: [16.0242, 0.00297764, 0.323401] ms
Create the context identifier:          For 50 iterations, it took: [0.010848, 0.00216516, 0.00233882] ms
Create the (display, view) processor:   For 50 iterations, it took: [1154.41, 25.5535, 48.1307] ms
Create the optimized processor:         For 50 iterations, it took: [13.6941, 5.75264, 5.91147] ms
Create the GPU processor:               For 50 iterations, it took: [26.3315, 13.6078, 13.8622] ms
Create the GPU shader:                  For 50 iterations, it took: [20.7317, 8.69257, 8.93335] ms
Create the CPU processor:               For 50 iterations, it took: [53.6597, 23.654, 24.2541] ms

replace-md5 branch 11072022

Create the config identifier:           For 50 iterations, it took: [10.645, 0.00152405, 0.214393] ms
Create the context identifier:          For 50 iterations, it took: [0.007985, 0.00152176, 0.00165102] ms
Create the (display, view) processor:   For 50 iterations, it took: [1189.82, 4.57797, 28.2828] ms
Create the optimized processor:         For 50 iterations, it took: [15.4956, 6.69194, 6.86801] ms
Create the GPU processor:               For 50 iterations, it took: [2.26372, 0.713385, 0.744392] ms
Create the GPU shader:                  For 50 iterations, it took: [18.391, 8.48604, 8.68414] ms
Create the CPU processor:               For 50 iterations, it took: [23.8181, 10.661, 10.9241] ms

main branch 11072022 (OCIO_DISABLE_CACHE_FALLBACK)

Create the config identifier:           For 50 iterations, it took: [12.6743, 0.0018766, 0.255325] ms
Create the context identifier:          For 50 iterations, it took: [0.007492, 0.00159078, 0.0017088] ms
Create the (display, view) processor:   For 50 iterations, it took: [1096.16, 2.80633, 24.6734] ms
Create the optimized processor:         For 50 iterations, it took: [13.3289, 5.97358, 6.12069] ms
Create the GPU processor:               For 50 iterations, it took: [25.5712, 12.172, 12.44] ms
Create the GPU shader:                  For 50 iterations, it took: [15.6555, 9.43399, 9.55842] ms
Create the CPU processor:               For 50 iterations, it took: [54.54, 26.4725, 27.0338] ms

replace-md5 branch 11072022 (OCIO_DISABLE_CACHE_FALLBACK)

Create the config identifier:           For 50 iterations, it took: [12.9352, 0.00173798, 0.260408] ms
Create the context identifier:          For 50 iterations, it took: [0.007434, 0.00156096, 0.00167842] ms
Create the (display, view) processor:   For 50 iterations, it took: [1228.41, 3.18332, 27.6878] ms
Create the optimized processor:         For 50 iterations, it took: [15.4187, 7.34852, 7.50992] ms
Create the GPU processor:               For 50 iterations, it took: [2.07713, 0.733996, 0.760859] ms
Create the GPU shader:                  For 50 iterations, it took: [18.3291, 8.69082, 8.88359] ms
Create the CPU processor:               For 50 iterations, it took: [23.83, 12.4507, 12.6783] ms


Case 2. - 25 different shots using identity CDLs

main branch 11072022

Create the config identifier:           For 50 iterations, it took: [11.456, 0.00167531, 0.230761] ms
Create the context identifier:          For 50 iterations, it took: [0.003817, 0.00152845, 0.00157422] ms
Create the (display, view) processor:   For 50 iterations, it took: [1136.91, 24.715, 46.959] ms
Create the optimized processor:         For 50 iterations, it took: [13.3274, 0.00111346, 0.267639] ms
Create the GPU processor:               For 50 iterations, it took: [25.8516, 0.000753208, 0.517771] ms
Create the GPU shader:                  For 50 iterations, it took: [18.0081, 8.10345, 8.30154] ms
Create the CPU processor:               For 50 iterations, it took: [45.137, 0.00138575, 0.904099] ms

replace-md5 branch 11072022

Create the config identifier:           For 50 iterations, it took: [14.1439, 0.00165418, 0.284498] ms
Create the context identifier:          For 50 iterations, it took: [0.005422, 0.00146335, 0.00154252] ms
Create the (display, view) processor:   For 50 iterations, it took: [1175.33, 3.98283, 27.4098] ms
Create the optimized processor:         For 50 iterations, it took: [15.2188, 0.00139678, 0.305745] ms
Create the GPU processor:               For 50 iterations, it took: [1.85187, 0.000756634, 0.0377789] ms
Create the GPU shader:                  For 50 iterations, it took: [19.7146, 9.87258, 10.0694] ms
Create the CPU processor:               For 50 iterations, it took: [32.8429, 0.00404506, 0.660823] ms

main branch 11072022 (OCIO_DISABLE_CACHE_FALLBACK)

Create the config identifier:           For 50 iterations, it took: [18.9911, 0.00184297, 0.381629] ms
Create the context identifier:          For 50 iterations, it took: [0.005279, 0.00214949, 0.00221208] ms
Create the (display, view) processor:   For 50 iterations, it took: [1119.13, 3.53349, 25.8455] ms
Create the optimized processor:         For 50 iterations, it took: [15.5118, 6.95756, 7.12865] ms
Create the GPU processor:               For 50 iterations, it took: [29.7942, 12.6454, 12.9884] ms
Create the GPU shader:                  For 50 iterations, it took: [15.352, 6.89911, 7.06817] ms
Create the CPU processor:               For 50 iterations, it took: [43.7347, 21.4723, 21.9176] ms

replace-md5 branch 11072022 (OCIO_DISABLE_CACHE_FALLBACK)

Create the config identifier:           For 50 iterations, it took: [11.5274, 0.00158257, 0.232099] ms
Create the context identifier:          For 50 iterations, it took: [0.003214, 0.00141204, 0.00144808] ms
Create the (display, view) processor:   For 50 iterations, it took: [1096.68, 2.52752, 24.4105] ms
Create the optimized processor:         For 50 iterations, it took: [30.5916, 8.87726, 9.31155] ms
Create the GPU processor:               For 50 iterations, it took: [2.25362, 0.765239, 0.795007] ms
Create the GPU shader:                  For 50 iterations, it took: [17.7799, 8.9354, 9.11229] ms
Create the CPU processor:               For 50 iterations, it took: [28.4583, 12.0914, 12.4187] ms

Numbers comes from a CentOS 7 workstation, sorry for the poor formatting, I may edit later to create a table for it if need be. I'm not sure how often the case where the cache fallback hits happen in production but I guess there was a demand for it so it's probably best to not use that environment variable if possible.

remia avatar Jul 11 '22 15:07 remia

This looks good to me, but one question is whether we want to embed the header in the repo, or if we should implement our CMake pattern for optionally pulling the dependency from the upstream repo, or utilizing an install on the user's machine. One benefit of removing md5 is removing an embedded dependency, but here we're adding another. This could be addressed later.

michdolan avatar Oct 20 '22 12:10 michdolan

Hi @Shootfast - we discussed this in the most recent TSC. We'd like to merge this in for the upcoming 2.2 release as it stands, which will mean the new dependency add. Wanted to give you a heads up - if we don't hear from you, we're going to merge this next week.

carolalynn avatar Oct 20 '22 16:10 carolalynn

@michdolan thanks for raising that, I was also thinking having it embedded could be beneficial to more tightly control which version is used and not add a dependency for package maintainers. But I agree this should be a discussion to have later.

I did a check again on possible exported symbols and reached the same conclusions as my previous comment. It seems we only have non-exported local symbol corresponding to xxHash in the generated library. These don't show up at all if you use nm -g, so I believe we are good here but tagging @lgritz @Shootfast in case this methodology is flawed.

nm myinstall/lib64/libOpenColorIO.so | grep -Ii xxh
000000000023b9c0 t XXH3_hashLong_128b_default.constprop.0
000000000023b5e0 t XXH3_len_129to240_128b.isra.0.constprop.0
00000000004e7ec0 r _ZL12XXH3_kSecret

remia avatar Oct 21 '22 11:10 remia

Should I work on migrating the xxHash dependency to external instead now then? I'd guess if we want to do that later it would need to be in a 2.3 release.

remia avatar Oct 26 '22 09:10 remia

I had time to review it. It seems all good. It seems to have conflicts that must be resolved?

cedrik-fuoco-adsk avatar Oct 27 '22 14:10 cedrik-fuoco-adsk

I just rebased from main to fix the conflict.

remia avatar Oct 27 '22 14:10 remia