HdrHistogram.NET Lzct SSE instruction when available

We currently have at the heart of the hot path a manual method for find the leading zero count. This is used to identify the correct bucket to assign a recorded value. Frustratingly, this is supported as an intrinsic instruction on most modern CPU architectures.

The code is found in Bitwise.NumberOfLeadingZeros, and has been isolated with the intent that it can be a single place to refactor/optimize if the opportunity arises.

Follow this .NET core issue for progress/resolution

dotnet/corefx#2209
possibly moved from coreFx to CoreCLR https://github.com/dotnet/coreclr/issues/8089

Jan 15 '17 14:01 LeeCampbell

Couldn't that class be written like this:

internal static class Bitwise {

    private static readonly int[] Lookup;

    static Bitwise()
    {
        Lookup = new int[256];
        for (int i = 1; i < 256; ++i)
        {
            Lookup[i] = (int)(Math.Log(i) / Math.Log(2));
        }
    }
    public static int NumberOfLeadingZeros(ulong value)
    {
        if (value < uint.MaxValue)
            return 63 - Log2((uint)value);
        return 31 - Log2((uint)(value>>32));
    }

    private static int Log2(uint i)
    {
        if (i >= 0x1000000) { return Lookup[i >> 24] + 24; }
        if (i >= 0x10000) { return Lookup[i >> 16] + 16; }
        if (i >= 0x100) { return Lookup[i >> 8] + 8; }
        return Lookup[i];
    }
}

or inlined:

internal static class Bitwise
{
    private static readonly int[] Lookup;

    static Bitwise()
    {
        Lookup = new int[256];
        for (int i = 1; i < 256; ++i)
        {
            Lookup[i] = (int)(Math.Log(i) / Math.Log(2));
        }
    }

    public static int NumberOfLeadingZeros(ulong value)
    {
        if (value >= 0x100000000000000) { return 7 - Lookup[value >> 56]; }
        if (value >= 0x1000000000000) { return 15 - Lookup[value >> 48]; }
        if (value >= 0x10000000000) { return 23 - Lookup[value >> 40]; }
        if (value >= 0x100000000) { return 31 - Lookup[value >> 32]; }
        if (value >= 0x1000000) { return 39 - Lookup[value >> 24]; }
        if (value >= 0x10000) { return 47 - Lookup[value >> 16]; }
        if (value >= 0x100) { return 56 - Lookup[value >> 8]; }
        return 63 - Lookup[value];
    }
}

I don't have any profiling (or for that matter any stake in this project at all, I was searching for this particular function) but it seems odd that these functions are operating on signed values and that the impl for int is fundamentally different than long when the high half of a long is an int; surely one way is faster than the other.

Mar 10 '17 16:03 bbarry

Thanks for the interest and contribution @bbarry . Personally I am not a bit-fiddler magician, I am just leaning on the contributions from Gil and people like yourself. I will add your ideas to my benchmarks and see what bubbles up.

Jul 14 '17 07:07 LeeCampbell

Some quick feedback. This code base uses long not ulong. Negative values mean something to us.

Updating the proposed code to use long and work correctly (that hard coded 56 should be a 55), I added it to my benchmark suite.

For 64bit values:

 |                   Method |       Jit | Runtime |         Mean | Scaled |
 |------------------------- |---------- |-------- |-------------:|-------:|
 |    CurrentImplementation | LegacyJit |     Clr |     842.9 ns |   1.00 |
 |                   BBarry | LegacyJit |     Clr |     805.3 ns |   0.96 |
 |    CurrentImplementation |    RyuJit |     Clr |     712.3 ns |   1.00 |
 |                   BBarry |    RyuJit |     Clr |     853.6 ns |   1.20 |
 |    CurrentImplementation |    RyuJit |    Core |     750.9 ns |   1.00 |
 |                   BBarry |    RyuJit |    Core |   1,224.5 ns |   1.63 |

Note that on LegacyJit it sneaks ahead of the current implementation (but still comes 3rd to some other implementations). However on RyuJIT it is significantly slower.

For predominantly 32 bit values (which in this libraries case are the more performance sensitive values):

 |                   Method |       Jit | Runtime |        Mean | Scaled |
 |------------------------- |---------- |-------- |------------:|-------:|
 |    CurrentImplementation | LegacyJit |     Clr |    442.8 ns |   1.00 |
 |              BBarry_imp2 | LegacyJit |     Clr |    464.9 ns |   1.05 |
 |    CurrentImplementation |    RyuJit |     Clr |    359.5 ns |   1.00 |
 |              BBarry_imp2 |    RyuJit |     Clr |    497.3 ns |   1.38 |
 |    CurrentImplementation |    RyuJit |    Core |    348.4 ns |   1.00 |
 |              BBarry_imp2 |    RyuJit |    Core |    661.8 ns |   1.90 |

Again we see that it is 5%, 38% and 90% slower than the current implementation.

I hope you find this valuable.

Jul 14 '17 13:07 LeeCampbell

It appears that this functionality has been merged to dotnet/coreclr master on via https://github.com/dotnet/coreclr/pull/14456. I am not sure when this will be released or how it will be made available.

Nov 17 '17 02:11 LeeCampbell