Nim
Nim copied to clipboard
closure iterators are much slower with ARC/ORC
What happened?
here I get 8x slow down when I switch to --mm:arc
.
import std/sugar
import benchy
proc toIter*[T](s: Slice[T]): iterator: T =
iterator it: T {.closure.} =
for x in s.a..s.b:
yield x
return it
proc filter*[T](i: iterator: T, f: proc(x: T): bool): iterator: T =
iterator it: T {.closure.} =
for x in i():
if f(x):
yield x
result = it
iterator filter*[T](i: iterator: T, f: proc(x: T): bool): T =
for x in i():
if f(x):
yield x
timeIt "closure iterator":
var acc = 0
for i in (1..100_000).
toIter.
filter(x => x mod 2 == 0).
filter(x => x mod 4 == 0).
filter(x => x mod 8 == 0).
filter(x => x mod 16 == 0).
filter(x => x mod 32 == 0).
filter(x => x mod 64 == 0).
filter(x => x mod 128 == 0).
filter(x => x mod 256 == 0).
filter(x => x mod 512 == 0):
acc.inc i
Nim Version
Nim Compiler Version 1.7.1 [Windows: amd64]
Compiled at 2022-07-17
Copyright (c) 2006-2022 by Andreas Rumpf
active boot switches: -d:release
Current Standard Output Logs
with ARC: nim --mm:arc -d:release r .\play.nim
:
min time avg time std dv runs name
4.022 ms 4.352 ms ±0.207 x1000 closure iterator
with refC: nim -d:release r .\play.nim
:
min time avg time std dv runs name
0.841 ms 0.924 ms ±0.045 x1000 closure iterator
Expected Standard Output Logs
almost the same numbers
Additional Information
the numbers are almost the same with version 1.6.6
, I will try to bisect the regression
Because the devel enabled threads:on
by default. Closure iterators are much slower with threads:on
in ARC/ORC. Use threads:off
as an optimization for now.
I did a profile again
thread specific logic increase the time and mimalloc doesn't help this.
I seen a little slower performance on multiple places, even for simple programs,
not so extreme like 8x but still, I wonder if the threads:on
default should be reconsidered. 🤔
Out of curiosity, what's the underlying cause?
I wish I knew, I suspect a terrible "thread local storage" implementation.
It is not related to ARC/ORC, I can repro on Linux with --mm:none
and --threads:on
/ --threads:off
.
What would be interesting is a comparison between Nim and other (compiled) languages with similar closure semantics/mechanisms. Then we might be able to uncover ways to optimized the currently generated code.
It's not very interesting, Nim always allocates but often enough the closures really do escape so that it cannot be optimized out. Where the closures don't escape idiomatic Nim already uses templates.
It's the thread local storage emulation or something similiar.
Hm, how slow are Nim's closures compared to, say, Go?
With --threads:off
my timings are: 0.571 ms for ARC and 0.569 ms for refc
. Close enough.
So using --threads:off
is recommended?. 🤔
If you don't use threads, yes.
Then maybe --threads:on
should be reconsidered as default. 🤔
No, people should exploit threads in order to get performance instead. shrug