OpenColorIO Choice of hash function for caching

Starting from v2, we use hashing more heavily in the library, notably to compute the various Processor cacheID. The current algorithm we use is based on MD5 which is probably not the best fit for our use case. As pointed out in a recent slack discussion, we should use a non cryptographic hash with reasonable speed and good quality (minimise collisions). Some candidates were mentioned: farmhash, xxhash or wyhash. Some comparisons are also available from this article (from 2016).

As an example, see this PR which replace MD5 by xxHash. This gives good speedup, around 10-15x for a 65 3DLUT cacheID (6ms to 0.4ms). This is imported as a header only library, similar to how we use sampleicc. If there is reluctance to add such a (internal) dependency, we could also choose to use something simpler by default in the library and provide user override like we already do for the filename hash function. From C++ 17, std::hash has a specialisation for string_view which could fit our needs but it wouldn't be available for all our supported environments so probably gets disqualified.

As an example where the current implementation can be relatively slow, consider an application generating a Processor representing a full colour management pipeline. Depending on the source, this could mean inverting the colour rendering, applying a number of looks, then colour rendering again. Such a Processor can contain a number of 1D and 3DLUT. While the Processor outer cache uses a simplified hash not including the LUT entries, the fallback mechanism does. In my example case, where per shot context changes, this adds at minimum around 24ms to the Processor creation time just for the fallback. This is also valable if the client code call the Processor::getCacheID().

Example ocioperf number on a heavy Processor:

Create the (display, view) processor:   For 50 iterations, it took: [2000.68, 27.5805, 67.0426] ms
// Switch to xxHash
Create the (display, view) processor:   For 50 iterations, it took: [2182.96, 4.08732, 47.6649] ms

Any thoughts?

May 13 '22 14:05 remia

My only thoughts are, that's pretty damn cool, Remi. Presumably, transform hashes would be constant across architectures and platforms, right?

May 16 '22 15:05 zachlewis

My only thoughts are, that's pretty damn cool, Remi. Presumably, transform hashes would be constant across architectures and platforms, right?

At least from what I see in xxHash first lines of ReadMe yes, but that's probably something to check for the others options as well. Do you have already a use case for that @zachlewis ?

Code is highly portable, and hashes are identical across all platforms (little / big endian).

May 16 '22 15:05 remia

At least from what I see in xxHash first lines of ReadMe yes, but that's probably something to check for the others options as well. Do you have already a use case for that @zachlewis ?

Nothing I've implemented in earnest yet, and I've been meaning to verify that the current hashing mechanism is stable across platforms / architectures; but I've been experimenting with using the cacheID of the most losslessly-optimized version of a given transform (or transforms) as the ProcessList "id" (i.e., CLF and CTF metadata). (And I'm not sure this is the wisest decision to begin with, because even though they amount to the same thing computationally, I dunno if the "id" should refer to the cumulative transform, or the unique composition thereof and likewise, I think I'd rather have two IDs for my purposes -- one identifying the "purest", highest-resolution version of a given transform -- parametric transforms, half-domain LUT1Ds, finishing-tier cubes -- and one for identifying the "current", comparatively lossy version of the transform as it exists in a given serialized form and container ).

Basically, I was considering using hashes as a means for uniquely identifying transforms (i.e., via a lookup table -- which may as well be an OCIO config utilizing Named Transforms), independent of any other information (CLF/CTF metadata, OCIO color space naming, ACEStransformIDs, file naming, whatever). Could also be used for identifying identical transforms across (and within) separate configs, sort of like ad-hoc "equality groups" -- which I'd imagine would be useful for cross-config communication. I was considering opening a ticket suggesting such a capability (that cross-config Processors obey equality group equivalence as a means for interfacing with each other, as an alternative to specifying interchange roles; but I also like the idea of mandatory interchange roles.)

This would be particularly useful in pipelines such as ours, where we use FileTransforms to reference filenames that reflect a general task ("camera_decoding.ctf") rather than a specific transform ("Log3G10_RWG_to_ACES.ctf"). (In practice, the less specific file/names are actually symlinks to more specific file/names, but I'm trying to come up with a way of representing that same context-bound variance without relying on the filesystem -- i.e., for use with config archives)

Additionally, Autodesk's SynColor schema seems to have Info/Functional_MD5 element, which, presumably, is used for similar purposes -- unique identification of a transform (or optimized form of a transform?). I guess Doug could probably speak to that.

ANYWAY -- I wouldn't personally be heartbroken if whatever hash scheme we choose varies across platforms and architectures, for the sake of speed -- although I do think it would be useful if the API provided an easy way to compute a architecture-platform-invariant hash that can be used for such purposes as identification, even if it's not the same hash scheme used internally for cache IDs).

(sidebar -- you wouldn't believe just how much storage multiple versions of the same LUTs with different file names can take up at a finishing facility! I can recall one facility that at one point had decided to just name all their LUTs after MD5 hashes; and apparently, they pretty quickly decided to stop doing that, and to go back to human readable filenames)

May 16 '22 17:05 zachlewis

Thanks for the answer @zachlewis.

and I've been meaning to verify that the current hashing mechanism is stable across platforms / architectures

It seems to be the case as it is checked in the unit tests in a few places: https://github.com/AcademySoftwareFoundation/OpenColorIO/blob/main/tests/cpu/Processor_tests.cpp#L31 https://github.com/AcademySoftwareFoundation/OpenColorIO/blob/main/tests/cpu/GpuShader_tests.cpp#L40 https://github.com/AcademySoftwareFoundation/OpenColorIO/blob/main/tests/cpu/CPUProcessor_tests.cpp#L820

I was considering opening a ticket suggesting such a capability (that cross-config Processors obey equality group equivalence as a means for interfacing with each other, as an alternative to specifying interchange roles; but I also like the idea of mandatory interchange roles.)

Interesting thought, although the interchanges roles have indeed other use cases as discussed elsewhere to bridge with other color management systems. I assume for this ad-hoc equality group detection, you would still need to have known anchor point in each configs to make sure the comparison of transform hash is meaningful.

Am I understanding correctly that the other things you mention mostly revolve around storage space savings?

In any case, I think we should maintain the existing behaviour, that is architecture / platform independent hash, which xxHash seem to already provide.

May 20 '22 19:05 remia

In any case, I think we should maintain the existing behaviour, that is architecture / platform independent hash, which xxHash seem to already provide.

Superb -- thanks for all the investigating.

Interesting thought, although the interchanges roles have indeed other use cases as discussed elsewhere to bridge with other color management systems. I assume for this ad-hoc equality group detection, you would still need to have known anchor point in each configs to make sure the comparison of transform hash is meaningful.

It was a pretty half-baked thought :) I was thinking, such transform hashes could be a quick way to check if Config A and Config B shared common ColorSpace definitions; which would mean the configs share a common reference space.

Am I understanding correctly that the other things you mention mostly revolve around storage space savings?

lol no, not exactly -- sorry for the confusion. I'm just rambling. The other things I'm mentioning mostly revolve around validating and tracking transforms in our production pipelines, which are typically very file-transform heavy, and evolve over time, and usually consist of intentionally ambiguously named files ("camera_rgb_to_ap0", "camera_log_to_lin") -- for stuff like keeping sites in sync, succinctly representing the "state" of a show, making sure the same transforms are available to OCIOv1 / SynColor (Flame)... I'm concerned with objectively identifying and tracking, for lack of a better term, "pure transforms" (and perhaps their inverses) -- i.e., the mathematical functions a serialized file transform intends to represent. Kind of like empirically derived, container-independent "pure-transform-id" metadata.

For example, the same LogCameraTransform might be serialized as a parametric CTF node, or as a half-domain CLF-2.0 array; or as a 1024-sized spi1d array; or as a set of sampled GradingRGBSplineTransform control points and slopes; or any of the above, but any number of NoOp transforms on either end. I would expect that optimized and less-optimized versions of the same OCIO Processor would yield CTFs with different transforms and ProcessorList id values (akin to XMP's "documentId" tag); and I'd also like to track a value uniquely identifying a common "source" or "pure" transform origin (akin to XMP's "originalDocumentId" tag).

Anyway, here are the functions I'm using for computing that "pure-transform" ID:

def get_optimized_transform(
    transform: ocio.Transform,
    optimization: ocio.OptimizationFlags = ocio.OPTIMIZATION_ALL,
    inverse: bool = False,
    config: ocio.Config = None,
    context: ocio.Context = None,
    name: str = None,
    ID: str = None,
    **context_vars: Mapping[str, str]
) -> ocio.Transform:
    config = config or ocio.GetCurrentConfig()
    context = deepcopy(context or config.getCurrentContext())
    direction = (ocio.GetInverseTransformDirection(transform.getDirection()) if inverse else transform.getDirection())
    for k, v in context_vars.items():
        context[k] = v
        
    optimized_transform = config.getProcessor(
        context=context, transform=transform, direction=ocio.TransformDirection(direction)
    ).getOptimizedProcessor(optimization).createGroupTransform()
    
    if hasattr(transform, 'getFormatMetadata'):
        fmd = transform.getFormatMetadata()
        name, ID = name or fmd.getName(), ID or fmd.getID()
    if name:
        optimized_transform.getFormatMetadata().setName(name)
    if ID:
        optimized_transform.getFormatMetadata().setId(ID)
        
    return optimized_transform


def get_transform_cache_id(
    transform: ocio.Transform,
    inverse: bool = False,
    config: ocio.Config = None,
    context: ocio.Context = None,
    optimization: Optional[Union[ocio.OptimizationFlags, int]] = ocio.OptimizationFlags.OPTIMIZATION_ALL,
    **context_vars: Mapping[str, str],
) -> str:
    transform = get_optimized_transform(
        transform, optimization=optimization, config=config, context=context, inverse=inverse, **context_vars)
    return ocio.Config().getProcessor(transform).getCacheID()

May 23 '22 14:05 zachlewis