Primes Improve C# perf from 300 to 1305 passes per second

Description

Contributing requirements

[x] I read the contribution guidelines in CONTRIBUTING.md.
[x] I placed my solution in the correct solution folder.
[x] I added a README.md with the right badge(s).
[x] I added a Dockerfile that builds and runs my solution.
[x] I selected drag-race as the target branch.
[x] All code herein is licensed compatible with BSD-3.

Jul 25 '21 00:07 davepl

You committed a bunch of files from the obj/ directory, which shouldn't be part of the repo. A .gitignore failure?

Edit: Hmm. Looks like @rbergen created a PR against this PR to fix that problem.

Jul 25 '21 18:07 Kinematics

I like that you continued on the path of working with the halfFactor. I got some stuff that you might want to address though:

Adding readonly to halfLimit should help the compiler a little bit.

I havent used ArrayPool before but returning the array to the pool at the end of the calculation doesnt feel right. If you are calculating the primes its kind of assumed that we want to read them back for some kind of other calculation or in this case we want them around for validating later. I guess its C#s "use after free".

~~Stlye preference: << 1 and >> 1. Since the intention is to divide by 2 and multiply by 2. Using * 2 and / 2 instead makes it easier to read and it should compile to the same assembly anyway.~~

I also noticed this "bug" in the comments: //Remember that halfNum is already "pre-divided" by 2, so we divide it by 32 // rather than 64 when locating the 64-bit word in which the bit will land in

this didnt look right to me and to verify my expectations i did 64 >> 6 in in dotnet fiddle and as I expected it returned 1. ~~changing "bits[halfNum >> shiftAmount]" -> "bits[halfNum / 64]" should make it clearer and not impact performance.~~ It actually impacts performance by 20%, is this a compiler bug?

Update: turns out the compiler cant fully optimize divison to a bitshift since it cant prove that we dont get negative values. making the values uint would resolve the issue.

Jul 25 '21 23:07 ItalyToast

@davepl Thanks for merging my suggestions to remove the collateral files from this PR. Would you like me to merge this as this stands, or would you still like to consider the comment that @ItalyToast added?

Jul 27 '21 14:07 rbergen

I was also looking at improving the performance of the c# code, and made pretty much the same changes as you did. After comparing the two i noticed that mine was quicker. After loads of testing the differences i found that using an array of bytes with your solution is quicker than using an array of UInt64 (on my PC at least). The UInt64 ran at about 7900 iterations, where as changing the code to use bytes ran at about 9200 iterations.

Another suggestion would be to remove the AsSpan() on halfbits.AsSpan() as that also improves performance.

Note sure if this would be considered cheating, as it uses unsafe code, but if you make bits value a raw pointer

       fixed (byte* bits = &halfbits[0])

The code will run even faster as it removes bounds checking when setting values in the array. I got about 9700 iterations doing this.

Jul 27 '21 21:07 Pheidias73

This looks pretty slick.

A couple of observations / questions:

Using ArrayPool is potentially opening a can of worms. It's absolutely what one should do with C# to help alleviate GC costs. The question is whether we want every other solution recycling buffers instead of allocating -- I think some of these have been rejected in the past.
Big caveat with ArrayPool -- the array you get when you Rent it is not zeroed by default. It will be on the first iteration because it's been freshly allocated in the normal way. But when you get a recycled one, it will have all the data that was in there previously. It has to be zeroed (or otherwise filled) before use.
On the general fairness front, we should consider allowing JIT'd languages (C#, Java, Scala, etc.) to warm up the code for the sieve before taking the time measurement. Just do a short dry-run, then do the timed measurement afterwards. We're maybe a little biased to measuring JIT time.

Jul 28 '21 07:07 mike-barber

After the Java vs C# video, I was working on an unsafe pointer based version, which I guess would give similar performance to this. Unfortunately, my laptop crashed.

One thing I did want to add is that I saw a noticable speed increase by going from .NET 3.1 to .NET 6. I know it's still in preview, and my solution was different from this one. But maybe someone can try and see if it makes a difference for this implementation as well. Note that the performance difference was not huge, in the order of 1-2%.

Also, I totally agree with the statement above about allowing a warmup run for JIT languages.

Jul 28 '21 14:07 Rainmaker52

A couple of observations / questions:

Using ArrayPool is potentially opening a can of worms. It's absolutely what one should do with C# to help alleviate GC costs. The question is whether we want every other solution recycling buffers instead of allocating -- I think some of these have been rejected in the past.

@mike-barber We reject a solution if its algorithm knowingly recycles a once-allocated buffer. We do not reject a solution for using a platform/language-provided allocation mechanism, which ArrayPool is in the context of .NET. The fact of the matter is that the "low-level" standard C library generally also reuses buffers from the heap.

Big caveat with ArrayPool -- the array you get when you Rent it is not zeroed by default. It will be on the first iteration because it's been freshly allocated in the normal way. But when you get a recycled one, it will have all the data that was in there previously. It has to be zeroed (or otherwise filled) before use.

That's a very good point, I missed this when reviewing. @davepl I think a change to this PR is required that indeed clears the array after renting.

On the general fairness front, we should consider allowing JIT'd languages (C#, Java, Scala, etc.) to warm up the code for the sieve before taking the time measurement. Just do a short dry-run, then do the timed measurement afterwards. We're maybe a little biased to measuring JIT time.

We allow an implementation to warm up for a maximum of 5 seconds before executing the actual 5-second benchmark run, if it is thought (and preferably demonstrated) to mitigate an unfair performance disadvantage.

Jul 31 '21 12:07 rbergen

Thanks @rbergen -- that's good to have that clarification. ArrayPool is a pretty important part of what I use in C# every day :)

There are a couple of things to fix if we proceed using ArrayPool:

the thing I mentioned -- clearing it before use; and only clear as much as we need -- not the whole larger-than-needed array it gives us.
we can ask for a smaller array -- we're asking for 8x too large currently; sizeof(UInt64) is the number of bytes in the word, not bits.
it's kind of incorrect to Return the array while we're still using it, as @ItalyToast pointed out, at the end of the runSieve method. We're still using the array later to count the primes. This won't manifest as a bug in the single-threaded case because nothing else can ask to Rent it in the interim. But it'll definitely be a problem if anyone tries to multi-thread this. The way I deal with pooled arrays in my code is to implement IDisposable on class holding the array, and then use the usual using pattern.

Hope that helps!

Jul 31 '21 13:07 mike-barber

Are we certain the ArrayPool is not cleared by the system? I mean what if I'm in a different security context? Does it really return the last pool data? If it's doc'd that way I'll certainly add the step as long as its not redundant.

The size is an outright bug if you're correct, and I'll fix it. Same with inspecting the results post-return.

I'm not wise in the ways of Git yet. Is there an open issue that can be assigned to me, or something like that?

Jul 31 '21 22:07 davepl

@davepl ArrayPool clears the buffer on return if you tell it to via the clearArray parameter to Return. It defaults to false and you don't set it, so the documentation states it isn't cleared.

Concerning "flagging" this as an issue: from the perspective of this repo it is not an issue, because your PR isn't merged yet. It is part of your fork, so you or someone else could indeed open an issue for it, there. That could then be assigned to yourself.

Aug 01 '21 07:08 rbergen

Hi @davepl - apologies for the delay (time zones - I'm in Ireland)

Are we certain the ArrayPool is not cleared by the system? I mean what if I'm in a different security context? Does it really return the last pool data? If it's doc'd that way I'll certainly add the step as long as its not redundant.

ArrayPool keeps a pool of arrays, and yanks one out the pool when you ask for one (or creates a new one if the pool is empty). In order to maintain pools of these, the requested size is rounded up to the nearest power of 2. So if you ask for 23k or 26k or something, you'll get a 32k array -- this allows ArrayPool.Shared to group these into pools (32k, 64k, etc), which wouldn't work if it was just arbitrary sizes.

When you ask to Rent an array, it'll either return an existing one or create a new one. It's literally the exact same array someone Returned earlier, unmodified. If you want to observe this directly, set a breakpoint and peek at what you get back from Rent.

This might sound like a terrible idea, but it's like this for performance reasons, as pooling is really only used in performance contexts. Most of the time you don't actually want the overhead of a zeroed array. For example, if you're reading a bunch of bytes from a file or a socket or something, you're overwriting the array anyway. If you do want it zeroed, remember that the array you get is larger than you want: you'd only want to init the bit that you're using with something like myarray.AsSpan(0, mylen).Clear()

The docs are better than they used to be, and it is mentioned here: https://docs.microsoft.com/en-us/dotnet/api/system.buffers.arraypool-1.rent?view=net-5.0#remarks

This buffer is loaned to the caller and should be returned to the same pool using the Return method, so that it can be reused in subsequent calls to the Rent method. ... The array returned by this method may not be zero-initialized.

https://docs.microsoft.com/en-us/dotnet/api/system.buffers.arraypool-1.shared?view=net-5.0#remarks

The shared pool provides a default implementation of the ArrayPool<T> class that's intended for general applicability. A shared class maintains arrays of multiple sizes, and may hand back a larger array than was actually requested, but it will never hand back a smaller array than was requested. Renting a buffer from a shared class using the Rent method will result in an existing buffer being taken from the pool if an appropriate buffer is available or in a new buffer being allocated if one is not available.

If you're curious, the source code for ArrayPool and ArrayPool.Shared are here

https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffers/ArrayPool.cs (it's just an abstract class -- you can easily implement your own simple one with your own semantics if you like, using ConcurrentBag for storage)
https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffers/TlsOverPerCoreLockedStacksArrayPool.cs - this is the advanced one used in ArrayPool.Shared and the optimisations in there are interesting reading.

As @rbergen notes above, you can clear the array when you Return it, but this is a bad pattern from the Renters perspective, as you're relying on the behaviour of the Returner. The bigger problem for our use case is that you're clearing more than you need to this way. It's more efficient clearing just what you need when you get your rented buffer.

A bit long-winded, I know, but it's an interesting topic and I thought the context would be helpful.

Aug 01 '21 08:08 mike-barber

I did write my own C# implementation, but it's not different enough to be worth submitting. Feel free to steal anything you like from there if it's useful: https://github.com/mike-barber/daves-garage-primes/tree/csharp-fast-pooled/PrimeCSharp/solution_4

Useful things:

32-bit uint is faster than 64-bit ulong (found this in my Rust solutions too)
it's worth unrolling loops manually as C# doesn't do this. JIT is in too much of a hurry.

Aug 01 '21 08:08 mike-barber

@mike-barber Did you check using bytes rather than 32/64 bit numbers as i found that to be quicker.

Aug 01 '21 14:08 Pheidias73

@mike-barber Did you check using bytes rather than 32/64 bit numbers as i found that to be quicker.

I did do this previously in my Rust solution, but found 32-bit words to be a little bit faster there. I haven't tried this on C#: I believe byte is promoted to 32-bit int for those bitwise shift and mask operations anyway.

Aug 02 '21 08:08 mike-barber

My own comparison using GC-allocated arrays gave me results like:

8 bit array = 6420 loops
32 bit array = 6545 loops
64 bit array = 5956 loops

So 32-bit is a smidge faster than 8-bit, and both are notably faster than 64-bit. The performance hit for 64-bit (even with an x64 build target) was rather disappointing.

Aug 03 '21 01:08 Kinematics

For the record: on my machine, the changes I've made in the past week increase the number of passes from 2906 for the version currently in the drag-race branch, to 6977 for this version.

Jan 12 '23 13:01 rbergen

Primes Primes copied to clipboard

Improve C# perf from 300 to 1305 passes per second

Description

Contributing requirements

Primes
Primes copied to clipboard