ImTools icon indicating copy to clipboard operation
ImTools copied to clipboard

Try new .Net Core 2.1 PopCount() intrinsic for HAMT implementation

Open dadhi opened this issue 6 years ago • 9 comments

Currently it uses HammingWeight algorithm for calculating number of set bits in a int value: https://bitbucket.org/dadhi/dryioc/src/e788ffce78b727b35fda5355a2297d40f0e7731b/Net45/Playground/HashArrayMappedTrieTests.cs#lines-190

Here is the thing Aye, that's there System.Runtime.Intrinsics.X86.PopCount(uint/ulong)

First step will to BDN benchmark against Swar or HammingWeight with PopCount.

dadhi avatar Jun 15 '18 16:06 dadhi

Have you done that? May I do it?

dzmitry-lahoda avatar Jan 17 '19 07:01 dzmitry-lahoda

No, not yet. Yes, you can try!

dadhi avatar Jan 17 '19 07:01 dadhi

@dzmitry-lahoda I have added two basic HAMT implementations into the Playground project with some tests.

There are two (or more) implementations: IntHashTrie<V> and HashTrie<K, V> (for arbitrary keys).

The method(s) to consider for replace with PopCount are named uint GetSetBitsCount(uint n): https://github.com/dadhi/ImTools/blob/75b3b3b61db93653829ab9430ff9920745ea4b48/src/Playground/IntHashArrayMappedTrieTests.cs#L211

The hand-written benchmarks (without BenchmarkDotNet) are in https://github.com/dadhi/ImTools/blob/master/src/Playground/TreeBenchmarks.cs

dadhi avatar Jan 17 '19 09:01 dadhi

               Method | ItemCount |        Mean |     Error |       StdDev |
--------------------- |---------- |------------:|----------:|-------------:|
         SWAR_ImTools |      1000 |  2,362.5 ns |  47.07 ns |    73.284 ns |
               Popcnt |      1000 |    504.7 ns |  10.05 ns |     8.908 ns |
 SWAR_ImTools_Inlined |      1000 |  1,658.4 ns |  27.01 ns |    25.261 ns |
    SWAR_ForkOfCoreFx |      1000 |  2,378.0 ns |  45.01 ns |    42.106 ns |
         SWAR_ImTools |     10000 | 23,281.8 ns | 362.54 ns |   339.117 ns |
               Popcnt |     10000 |  4,912.9 ns |  73.60 ns |    68.845 ns |
 SWAR_ImTools_Inlined |     10000 | 16,221.2 ns | 265.45 ns |   248.305 ns |
    SWAR_ForkOfCoreFx |     10000 | 24,810.7 ns | 492.58 ns | 1,049.733 ns |

I did in my fork as it has .NET Core only build (i have hard time work with legacy projects:()

https://github.com/dzmitry-lahoda/ImTools/blob/unnoficial/src/Playground/BitCountsBenchmarks.cs

For synthetic benchmark.

  1. Seem better to mark for inlining to be 100% sure you are (I guess it will be better for exact that method, so will test https://gist.github.com/mrange/d6e7415113ebfa52ccb660f4ce534dd4#gistcomment-2026775)
  2. You SWAR is slightly better to some than other SWAR:) on my Intel 7gen
  3. Popcnt is 3 to 4 times faster.

I guess need to make TreeBenchmarks to be BDN. And add copy of Trie code. Will send results.

I guess I may merge these classes into separate branch and send pull. But better if whole solution will be .NET Core for contrib.

dzmitry-lahoda avatar Jan 17 '19 14:01 dzmitry-lahoda

Adding 10 items (ms):
Trie PopCnt- 0
Trie - 0

Getting one out of 10 items 1,000,000 times (ms):
Trie PopCnt - 21
Trie - 20
====================
Adding 100 items (ms):
Trie PopCnt- 2
Trie - 0

Getting one out of 100 items 1,000,000 times (ms):
Trie PopCnt - 39
Trie - 38
====================
Adding 1000 items (ms):
Trie PopCnt- 0
Trie - 0

Getting one out of 1000 items 1,000,000 times (ms):
Trie PopCnt - 40
Trie - 38
====================
Adding 10000 items (ms):
Trie PopCnt- 4
Trie - 3

Getting one out of 10000 items 1,000,000 times (ms):
Trie PopCnt - 49
Trie - 22
====================
Adding 100000 items (ms):
Trie PopCnt- 66
Trie - 80

Getting one out of 100000 items 1,000,000 times (ms):
Trie PopCnt - 64
Trie - 23
====================
Adding 1000000 items (ms):
Trie PopCnt- 1489
Trie - 1533

Getting one out of 1000000 items 1,000,000 times (ms):
Trie PopCnt - 18
Trie - 24

Strange result.

  1. May be transform to BDN to avoid any influences (but do not seems that issues relevant). I have intel 7700hq and process is 64 bits.
  2. Tried inline GetSetBitsCount. No changes.

So seem no need to popcnt for now for specific case.

dzmitry-lahoda avatar Jan 17 '19 15:01 dzmitry-lahoda

I see Trie is faster than Tree always. And only slower on 1M items to add. Interesting replacement of IoC. But may be be better try out BDN to measure memory and reduce any fluctuations. E.g. right now GC.Collect() seems(not sure) starts GC, but not waits to finish.

dzmitry-lahoda avatar Jan 17 '19 15:01 dzmitry-lahoda

popcnt is with netcoreapp3.0 preview. not with 2.1 or 2.2:) so I suggest to close story as it seems irrelevant for both performance and usage in near future.

dzmitry-lahoda avatar Jan 17 '19 15:01 dzmitry-lahoda

Thanks for the test.

  • BDN is needed - I fully agree
  • For some reason, I was thinking that PopCount is available in the .Net Core 2.1. Hmmm...

I will keep it open until I played with it myself, will decide how to proceed later.

dadhi avatar Jan 17 '19 15:01 dadhi

I was trying to get the code of intrinsic for some reason. Found that for .NET Core 2.1 there should be kind of System.Runtime.Intrinsics.Experimental.dll. But I could not find such.

dzmitry-lahoda avatar Jan 22 '19 06:01 dzmitry-lahoda