BoundedAsyncQueue performs poorly on Scala Native
Some benchmarking by @lbialy (https://github.com/lbialy/ce-jvm-vs-sn) shows a CE application in Scala Native 0.5.8 with a single-producer/single-consumer queue use much worse performance than the JVM:
---------------------------------
mode lto gc ms
---------------------------------
release-full full immix 4846
jvm n/a n/a 283
This performance difference is erased if the queue capacity is set such that CE uses the concurrent queue implementation instead:
---------------------------------
mode lto gc ms
---------------------------------
release-full full immix 457
jvm n/a n/a 514
Profiling on MacOS shows significant time being spent on exception handling in the async queue's notifyOne implementation.
So one conclusion here seems to be that exceptions are fairly expensive on Scala Native in a way that they aren't on the JVM. More specifically, on the JVM, throwing and catching exceptions is very cheap but generating a stack trace is quite expensive. It seems that on SN, both are pricy. This is a problematic corner case because a huge amount of performance-sensitive code written for the JVM makes the assumption that throw/catch is almost free, with the high-performance async queue being one of them. (in this case, the assumption allows us to implement a terminal state for the queue without relying on null or other sentinel values in-band)
As a temporary workaround, we can just disable the high performance queue on native (and we probably will tbh), but I would suggest that this is probably worth looking into on the Scala Native side. @WojciechMazur for viz.
Edit: As an aside, one notable example of a user of this technique is ZIO. I haven't looked at their fiber interpreter, but from what I gleaned based on conversations when it was first released, it sounded to me like the new runloop in 2.0 handled suspension and trampoline states by throwing a sentinel exception, allowing their runloop to more aggressively leverage the underlying stack. On the JVM, this is a fairly reasonable implementation, but exception performance on Native suggests that this technique would be incredibly slow on that compilation target.
For reference: I did run this on a large matrix on Apple M1 mbp and on Ryzen 7 2700X on Archlinux:
https://gist.github.com/lbialy/202901d3ec29d2d245103df1068eb945
The code in repo mentioned by op can handle different scenarios with a small amount of tinkering.
For visibility: @WojciechMazur