cats-effect BoundedAsyncQueue performs poorly on Scala Native

Some benchmarking by @lbialy (https://github.com/lbialy/ce-jvm-vs-sn) shows a CE application in Scala Native 0.5.8 with a single-producer/single-consumer queue use much worse performance than the JVM:

---------------------------------
mode         lto   gc          ms
---------------------------------
release-full full  immix     4846
jvm          n/a   n/a        283

This performance difference is erased if the queue capacity is set such that CE uses the concurrent queue implementation instead:

---------------------------------
mode         lto   gc          ms
---------------------------------
release-full full  immix      457
jvm          n/a   n/a        514

Profiling on MacOS shows significant time being spent on exception handling in the async queue's notifyOne implementation.

Aug 03 '25 15:08 reardonj

So one conclusion here seems to be that exceptions are fairly expensive on Scala Native in a way that they aren't on the JVM. More specifically, on the JVM, throwing and catching exceptions is very cheap but generating a stack trace is quite expensive. It seems that on SN, both are pricy. This is a problematic corner case because a huge amount of performance-sensitive code written for the JVM makes the assumption that throw/catch is almost free, with the high-performance async queue being one of them. (in this case, the assumption allows us to implement a terminal state for the queue without relying on null or other sentinel values in-band)

As a temporary workaround, we can just disable the high performance queue on native (and we probably will tbh), but I would suggest that this is probably worth looking into on the Scala Native side. @WojciechMazur for viz.

Edit: As an aside, one notable example of a user of this technique is ZIO. I haven't looked at their fiber interpreter, but from what I gleaned based on conversations when it was first released, it sounded to me like the new runloop in 2.0 handled suspension and trampoline states by throwing a sentinel exception, allowing their runloop to more aggressively leverage the underlying stack. On the JVM, this is a fairly reasonable implementation, but exception performance on Native suggests that this technique would be incredibly slow on that compilation target.

Aug 03 '25 15:08 djspiewak

For reference: I did run this on a large matrix on Apple M1 mbp and on Ryzen 7 2700X on Archlinux:

https://gist.github.com/lbialy/202901d3ec29d2d245103df1068eb945

The code in repo mentioned by op can handle different scenarios with a small amount of tinkering.

For visibility: @WojciechMazur

Aug 03 '25 20:08 lbialy