ldc LDC AA optimization makes it slower than debug

Tested Compiler Version: LDC 1.39.0 OS: Windows 10

import std.stdio;
import std.datetime.stopwatch;

immutable list = [

"X7vGm1pK",
"dJqL8zWn",
"Yf2RxT0b",
"M9cP5hVs",
"wB3NkZyQ",
"tXg7LJmD",
"H6FpR8Mz",
"Q4wT9YcK",
"Np3dL7JX",
"V0mK8R5X",
"Jq9Y2FpT",
"cL7N3XwB",
"dT8R5MzQ",
"M7Y9pK2X",
"XJq6wL3N",
"R8T5McYQ",
"Vp9K2X7L",
"dJ3wNYXq",
"T8R5Mc7Y",
"KX9V2pJL",
"qJ3NYX7w",
"R8T5Mc9Y",
"LpK2X7VJ",
"wNYXqJ3d",
"8T5Mc9YR",
"K2X7VLpJ",
"YXqJ3wNd",
"5Mc9YR8T",
"X7VLpJ2K",
"qJ3wNdYX",
"Mc9YR8T5",
"7VLpJ2KX",
"J3wNdYXq",
"9YR8T5Mc",
"VLpJ2KX7",
"wNdYXqJ3",
"YR8T5Mc9",
"LpJ2KX7V",
"NdYXqJ3w",
"R8T5Mc9Y",
"pJ2KX7VL",
"dYXqJ3wN",
"8T5Mc9YR",
"J2KX7VLp",
"YXqJ3wNd",
"T5Mc9YR8",
"2KX7VLpJ",
"XqJ3wNdY",
"5Mc9YR8T",
"KX7VLpJ2",
];

void main()
{

	auto res = benchmark!(
	()
	{
		void[0][string] dHash;
		foreach(v; list)
			dHash[v] = void[0].init;
		foreach(v; list)
			v in dHash;
	})(50_000);

	writeln(res);
}

That one built with ldc2 -release -O3 is actually slower than building with debug

Debug: [68 ms, 861 μs, and 9 hnsecs] O3: [86 ms, 47 μs, and 9 hnsecs]

Feb 15 '25 02:02 MrcSnm

Congratulations, I think you discovered an LLVM bug.

LLVM, when compiling without optimizations, on x86_64, will use a fast machine instruction selection system to generate machine code.

I have reduced your test case to this LLVM frontend IR:

; ModuleID = 'reduced.bc'
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

%core.time.Duration = type { i64 }

define [1 x %core.time.Duration] @_D3std8datetime9stopwatch__T9benchmarkS_D4test4mainFZ15__lambda_L62_C2FNaNbNfZvZQCaFNbNfkZG1S4core4time8Duration() {
  %1 = call ptr @_aaInX()
  br label %2

2:                                                ; preds = %0
  ret [1 x %core.time.Duration] zeroinitializer
}

; Function Attrs: memory(read)
declare ptr @_aaInX() #0

attributes #0 = { memory(read) }

When compiling using -O0 (no optimization), watch how LLVM x86 FastISel "forgot" there is a function call to _aaInX:

body:             |
  bb.0 (%ir-block.0):
    successors: %bb.1(0x80000000)

  bb.1 (%ir-block.2):
    %0:gr32 = MOV32r0 implicit-def dead $eflags
    %1:gr64 = SUBREG_TO_REG 0, killed %0, %subreg.sub_32bit
    $rax = COPY %1
    RET 0, $rax

Because the -O0 build does not call _aaInX, obviously, it will be much faster.

Mar 03 '25 14:03 liushuyu

I will re-arrange the findings and submit a bug report to LLVM upstream later.

Mar 03 '25 14:03 liushuyu

So, basically, there is a bug on the actual debug code?

Mar 03 '25 17:03 MrcSnm

Is this perhaps a valid optimisation given the memory(read) annotation? Not sure which side effects would still be allowed given that in recent LLVM.

Mar 03 '25 17:03 dnadlinger

Hmm good point (and thx for investigating liushuyu!) - I'd expect the call to be optimized away with enabled optimizations then, but not with -O0.

Mar 03 '25 18:03 kinke

clang v19.1.0 also elides calls to __attribute__((pure)) functions with -O0 (when not using the return value obviously): https://cpp.godbolt.org/z/r8haxhboY

It's still elided with -O2 and -O3 though.

Mar 03 '25 18:03 kinke

clang v19.1.0 also elides calls to __attribute__((pure)) functions with -O0 (when not using the return value obviously): https://cpp.godbolt.org/z/r8haxhboY

It's still elided with -O2 and -O3 though.

In our case, if you set a target to arm64 or any other target that is not x86, -O0 will produce code with the function call. The issue is only present when the target is x86.

Mar 04 '25 00:03 liushuyu