Code compiles for zen, but returns always 0
I am compiling the same code for skylake and ryzen, but on ryzen the return value is always zero and the runtime is practically zero as well.
Use this to reproduce:
#!/usr/bin/env python3
import ctypes
import llvmlite.binding as llvm
llvm.initialize()
llvm.initialize_native_target()
llvm.initialize_native_asmprinter()
llvm.initialize_native_asmparser()
code = """
define i64 @"test"(i64 %"N")
{
entry:
%"loop_cond" = icmp slt i64 0, %"N"
br i1 %"loop_cond", label %"loop", label %"end"
loop:
%"loop_counter" = phi i64 [0, %"entry"], [%"loop_counter.1", %"loop"]
%"in.0" = phi i32 [3, %"entry"], [%"out.0", %"loop"]
%"reg.0" = call i32 asm "add $2, $0", "=r,0,i" (i32 %"in.0", i32 1)
%"out.0" = call i32 asm "add $2, $0", "=r,0,i" (i32 %"reg.0", i32 1)
%"loop_counter.1" = add i64 %"loop_counter", 1
%"loop_cond.1" = icmp slt i64 %"loop_counter.1", %"N"
br i1 %"loop_cond.1", label %"loop", label %"end"
end:
%"ret" = phi i64 [0, %"entry"], [%"loop_counter", %"loop"]
ret i64 %"ret"
}
"""
features = llvm.get_host_cpu_features().flatten()
# znver1 on naples and skylake-avx512 on skylake-sp
for cpu in ["skylake-avx512", "znver1"]:
tm = llvm.Target.from_default_triple().create_target_machine(
cpu=cpu, opt=3)
tm.set_asm_verbosity(0)
module = llvm.parse_assembly(code)
asm = tm.emit_assembly(module)
print(asm)
with llvm.create_mcjit_compiler(module, tm) as ee:
ee.finalize_object()
cfptr = ee.get_function_address('test')
cfunc = ctypes.CFUNCTYPE(ctypes.c_int64, ctypes.c_int64)(cfptr)
print('->', cfunc(100000))
my output looks like this:
.text
.file "<string>"
.globl test
.p2align 4, 0x90
.type test,@function
test:
.cfi_startproc
testq %rdi, %rdi
jle .LBB0_1
movl $3, %ecx
movq $-1, %rdx
.p2align 4, 0x90
.LBB0_3:
#APP
addl $1, %ecx
#NO_APP
#APP
addl $1, %ecx
#NO_APP
leaq 1(%rdx), %rax
addq $2, %rdx
cmpq %rdi, %rdx
movq %rax, %rdx
jl .LBB0_3
retq
.LBB0_1:
xorl %eax, %eax
retq
.Lfunc_end0:
.size test, .Lfunc_end0-test
.cfi_endproc
.section ".note.GNU-stack","",@progbits
-> 99999
.text
.file "<string>"
.globl test
.p2align 4, 0x90
.type test,@function
test:
.cfi_startproc
testq %rdi, %rdi
jle .LBB0_1
movl $3, %ecx
movq $-1, %rdx
.p2align 4, 0x90
.LBB0_3:
#APP
addl $1, %ecx
#NO_APP
leaq 1(%rdx), %rax
addq $2, %rdx
cmpq %rdi, %rdx
movq %rax, %rdx
#APP
addl $1, %ecx
#NO_APP
jl .LBB0_3
retq
.LBB0_1:
xorl %eax, %eax
retq
.Lfunc_end0:
.size test, .Lfunc_end0-test
.cfi_endproc
.section ".note.GNU-stack","",@progbits
-> 0
I would expect the last -> 0 to be -> 99999.
I am using the lastest llvmlite pypi release (0.24.0), however it is not quite clear how I find out which llvm it is linked to.
That release is linked to LLVM 6.0.0. Are there any known Ryzen bugs with LLVM 6? I know 6.0.1 was released a few weeks ago...
None I am aware of from reading through the high level changelog.
It is unclear from your report which hardware you are running on for each test. Are you running both tests on a Ryzen CPU or both on a Skylake CPU?
(Either way, I would not expect the x86 code for either target to generate the wrong answer, of course.)
Actually, never mind, I see in the comments you are running on Skylake.
Does turning down the optimisation level "fix" it?
@seibert running on either Skylake or Ryzen, the code fails as long as cpu target is set to zenver1 and works correctly with skylake-avx512 (the assembly looks the same).
@stuartarchibald yes, optimization levels 1 and 0 work fine, while 2 and 3 fail. The only difference in assembly between -O1 and -O2/3 is that the second addl is moved to the end of the loop with 2 and 3, while it is in original order of the IR with -O1.
hmmmm, strange, do the command line tools do the same (just trying to eliminate llvmlite)?
@stuartarchibald I am currently unable to reproduce it using clang, but I am also not sure if -mtune=xenver1 is equivilent to the cpu string mentioned above.
I used the following code:
#include <stdio.h>
int test(int);
int main() {
printf("%d\n", test(100000));
return 0;
}
and
define i64 @"test"(i64 %"N")
{
entry:
%"loop_cond" = icmp slt i64 0, %"N"
br i1 %"loop_cond", label %"loop", label %"end"
loop:
%"loop_counter" = phi i64 [0, %"entry"], [%"loop_counter.1", %"loop"]
%"in.0" = phi i32 [3, %"entry"], [%"out.0", %"loop"]
%"reg.0" = call i32 asm "add $2, $0", "=r,0,i" (i32 %"in.0", i32 1)
%"out.0" = call i32 asm "add $2, $0", "=r,0,i" (i32 %"reg.0", i32 1)
%"loop_counter.1" = add i64 %"loop_counter", 1
%"loop_cond.1" = icmp slt i64 %"loop_counter.1", %"N"
br i1 %"loop_cond.1", label %"loop", label %"end"
end:
%"ret" = phi i64 [0, %"entry"], [%"loop_counter", %"loop"]
ret i64 %"ret"
}
and compiled with clang-mp-6.0 -mtune=znver1 main.c fail.ll
Does:
$ OPT="-O3" MTRIPLE="x86_64-unknown-unknown" MCPU="zenver1" opt -verify ${OPT} -mtriple=${MTRIPLE} -mcpu=${MCPU} fail.ll|llc ${OPT} -mtriple=${MTRIPLE} -mcpu=${MCPU}
help ?
That prints the assembly which I get when compiling for skylake (and works).
I also realized that in the original posting, the two assembly codes differ in the same way as with -01 vs -O2/3. I must have overlooked that.