LDC AA optimization makes it slower than debug
Tested Compiler Version: LDC 1.39.0 OS: Windows 10
import std.stdio;
import std.datetime.stopwatch;
immutable list = [
"X7vGm1pK",
"dJqL8zWn",
"Yf2RxT0b",
"M9cP5hVs",
"wB3NkZyQ",
"tXg7LJmD",
"H6FpR8Mz",
"Q4wT9YcK",
"Np3dL7JX",
"V0mK8R5X",
"Jq9Y2FpT",
"cL7N3XwB",
"dT8R5MzQ",
"M7Y9pK2X",
"XJq6wL3N",
"R8T5McYQ",
"Vp9K2X7L",
"dJ3wNYXq",
"T8R5Mc7Y",
"KX9V2pJL",
"qJ3NYX7w",
"R8T5Mc9Y",
"LpK2X7VJ",
"wNYXqJ3d",
"8T5Mc9YR",
"K2X7VLpJ",
"YXqJ3wNd",
"5Mc9YR8T",
"X7VLpJ2K",
"qJ3wNdYX",
"Mc9YR8T5",
"7VLpJ2KX",
"J3wNdYXq",
"9YR8T5Mc",
"VLpJ2KX7",
"wNdYXqJ3",
"YR8T5Mc9",
"LpJ2KX7V",
"NdYXqJ3w",
"R8T5Mc9Y",
"pJ2KX7VL",
"dYXqJ3wN",
"8T5Mc9YR",
"J2KX7VLp",
"YXqJ3wNd",
"T5Mc9YR8",
"2KX7VLpJ",
"XqJ3wNdY",
"5Mc9YR8T",
"KX7VLpJ2",
];
void main()
{
auto res = benchmark!(
()
{
void[0][string] dHash;
foreach(v; list)
dHash[v] = void[0].init;
foreach(v; list)
v in dHash;
})(50_000);
writeln(res);
}
That one built with ldc2 -release -O3 is actually slower than building with debug
Debug: [68 ms, 861 μs, and 9 hnsecs]
O3: [86 ms, 47 μs, and 9 hnsecs]
Congratulations, I think you discovered an LLVM bug.
LLVM, when compiling without optimizations, on x86_64, will use a fast machine instruction selection system to generate machine code.
I have reduced your test case to this LLVM frontend IR:
; ModuleID = 'reduced.bc'
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"
%core.time.Duration = type { i64 }
define [1 x %core.time.Duration] @_D3std8datetime9stopwatch__T9benchmarkS_D4test4mainFZ15__lambda_L62_C2FNaNbNfZvZQCaFNbNfkZG1S4core4time8Duration() {
%1 = call ptr @_aaInX()
br label %2
2: ; preds = %0
ret [1 x %core.time.Duration] zeroinitializer
}
; Function Attrs: memory(read)
declare ptr @_aaInX() #0
attributes #0 = { memory(read) }
When compiling using -O0 (no optimization), watch how LLVM x86 FastISel "forgot" there is a function call to _aaInX:
body: |
bb.0 (%ir-block.0):
successors: %bb.1(0x80000000)
bb.1 (%ir-block.2):
%0:gr32 = MOV32r0 implicit-def dead $eflags
%1:gr64 = SUBREG_TO_REG 0, killed %0, %subreg.sub_32bit
$rax = COPY %1
RET 0, $rax
Because the -O0 build does not call _aaInX, obviously, it will be much faster.
I will re-arrange the findings and submit a bug report to LLVM upstream later.
So, basically, there is a bug on the actual debug code?
Is this perhaps a valid optimisation given the memory(read) annotation? Not sure which side effects would still be allowed given that in recent LLVM.
Hmm good point (and thx for investigating liushuyu!) - I'd expect the call to be optimized away with enabled optimizations then, but not with -O0.
clang v19.1.0 also elides calls to __attribute__((pure)) functions with -O0 (when not using the return value obviously): https://cpp.godbolt.org/z/r8haxhboY
It's still elided with -O2 and -O3 though.
clang v19.1.0 also elides calls to
__attribute__((pure))functions with-O0(when not using the return value obviously): https://cpp.godbolt.org/z/r8haxhboYIt's still elided with
-O2and-O3though.
In our case, if you set a target to arm64 or any other target that is not x86, -O0 will produce code with the function call.
The issue is only present when the target is x86.