llvmlite Code compiles for zen, but returns always 0

I am compiling the same code for skylake and ryzen, but on ryzen the return value is always zero and the runtime is practically zero as well.

Use this to reproduce:

#!/usr/bin/env python3
import ctypes

import llvmlite.binding as llvm


llvm.initialize()
llvm.initialize_native_target()
llvm.initialize_native_asmprinter()
llvm.initialize_native_asmparser()

code = """
define i64 @"test"(i64 %"N")
{
entry:
  %"loop_cond" = icmp slt i64 0, %"N"
  br i1 %"loop_cond", label %"loop", label %"end"

loop:
  %"loop_counter" = phi i64 [0, %"entry"], [%"loop_counter.1", %"loop"]
  %"in.0" = phi i32 [3, %"entry"], [%"out.0", %"loop"]


  %"reg.0" = call i32 asm  "add $2, $0", "=r,0,i" (i32 %"in.0", i32 1)
  %"out.0" = call i32 asm  "add $2, $0", "=r,0,i" (i32 %"reg.0", i32 1)
  %"loop_counter.1" = add i64 %"loop_counter", 1
  %"loop_cond.1" = icmp slt i64 %"loop_counter.1", %"N"
  br i1 %"loop_cond.1", label %"loop", label %"end"

end:
  %"ret" = phi i64 [0, %"entry"], [%"loop_counter", %"loop"]

  ret i64 %"ret"
}
"""

features = llvm.get_host_cpu_features().flatten()
# znver1 on naples and skylake-avx512 on skylake-sp
for cpu in ["skylake-avx512", "znver1"]:
    tm =  llvm.Target.from_default_triple().create_target_machine(
        cpu=cpu, opt=3)
    tm.set_asm_verbosity(0)

    module = llvm.parse_assembly(code)
    asm = tm.emit_assembly(module)
    print(asm)
    with llvm.create_mcjit_compiler(module, tm) as ee:
        ee.finalize_object()
        cfptr = ee.get_function_address('test')
        cfunc = ctypes.CFUNCTYPE(ctypes.c_int64, ctypes.c_int64)(cfptr)
        print('->', cfunc(100000))

my output looks like this:

        .text
        .file   "<string>"
        .globl  test
        .p2align        4, 0x90
        .type   test,@function
test:
        .cfi_startproc
        testq   %rdi, %rdi
        jle     .LBB0_1
        movl    $3, %ecx
        movq    $-1, %rdx
        .p2align        4, 0x90
.LBB0_3:
        #APP
        addl    $1, %ecx
        #NO_APP
        #APP
        addl    $1, %ecx
        #NO_APP
        leaq    1(%rdx), %rax
        addq    $2, %rdx
        cmpq    %rdi, %rdx
        movq    %rax, %rdx
        jl      .LBB0_3
        retq
.LBB0_1:
        xorl    %eax, %eax
        retq
.Lfunc_end0:
        .size   test, .Lfunc_end0-test
        .cfi_endproc


        .section        ".note.GNU-stack","",@progbits

-> 99999
        .text
        .file   "<string>"
        .globl  test
        .p2align        4, 0x90
        .type   test,@function
test:
        .cfi_startproc
        testq   %rdi, %rdi
        jle     .LBB0_1
        movl    $3, %ecx
        movq    $-1, %rdx
        .p2align        4, 0x90
.LBB0_3:
        #APP
        addl    $1, %ecx
        #NO_APP
        leaq    1(%rdx), %rax
        addq    $2, %rdx
        cmpq    %rdi, %rdx
        movq    %rax, %rdx
        #APP
        addl    $1, %ecx
        #NO_APP
        jl      .LBB0_3
        retq
.LBB0_1:
        xorl    %eax, %eax
        retq
.Lfunc_end0:
        .size   test, .Lfunc_end0-test
        .cfi_endproc


        .section        ".note.GNU-stack","",@progbits

-> 0

I would expect the last -> 0 to be -> 99999.

I am using the lastest llvmlite pypi release (0.24.0), however it is not quite clear how I find out which llvm it is linked to.

Jul 27 '18 14:07 cod3monk

That release is linked to LLVM 6.0.0. Are there any known Ryzen bugs with LLVM 6? I know 6.0.1 was released a few weeks ago...

Jul 27 '18 14:07 seibert

None I am aware of from reading through the high level changelog.

Jul 27 '18 14:07 cod3monk

It is unclear from your report which hardware you are running on for each test. Are you running both tests on a Ryzen CPU or both on a Skylake CPU?

Jul 27 '18 14:07 seibert

(Either way, I would not expect the x86 code for either target to generate the wrong answer, of course.)

Jul 27 '18 14:07 seibert

Actually, never mind, I see in the comments you are running on Skylake.

Jul 27 '18 14:07 seibert

Does turning down the optimisation level "fix" it?

Jul 27 '18 14:07 stuartarchibald

@seibert running on either Skylake or Ryzen, the code fails as long as cpu target is set to zenver1 and works correctly with skylake-avx512 (the assembly looks the same).

@stuartarchibald yes, optimization levels 1 and 0 work fine, while 2 and 3 fail. The only difference in assembly between -O1 and -O2/3 is that the second addl is moved to the end of the loop with 2 and 3, while it is in original order of the IR with -O1.

Jul 27 '18 15:07 cod3monk

hmmmm, strange, do the command line tools do the same (just trying to eliminate llvmlite)?

Jul 27 '18 15:07 stuartarchibald

@stuartarchibald I am currently unable to reproduce it using clang, but I am also not sure if -mtune=xenver1 is equivilent to the cpu string mentioned above.

I used the following code:

#include <stdio.h>
int test(int);
int main() {
    printf("%d\n", test(100000));
    return 0;
}

and

define i64 @"test"(i64 %"N")
{
entry:
  %"loop_cond" = icmp slt i64 0, %"N"
  br i1 %"loop_cond", label %"loop", label %"end"

loop:
  %"loop_counter" = phi i64 [0, %"entry"], [%"loop_counter.1", %"loop"]
  %"in.0" = phi i32 [3, %"entry"], [%"out.0", %"loop"]


  %"reg.0" = call i32 asm  "add $2, $0", "=r,0,i" (i32 %"in.0", i32 1)
  %"out.0" = call i32 asm  "add $2, $0", "=r,0,i" (i32 %"reg.0", i32 1)
  %"loop_counter.1" = add i64 %"loop_counter", 1
  %"loop_cond.1" = icmp slt i64 %"loop_counter.1", %"N"
  br i1 %"loop_cond.1", label %"loop", label %"end"

end:
  %"ret" = phi i64 [0, %"entry"], [%"loop_counter", %"loop"]

  ret i64 %"ret"
}

and compiled with clang-mp-6.0 -mtune=znver1 main.c fail.ll

Jul 27 '18 15:07 cod3monk

Does:

$ OPT="-O3" MTRIPLE="x86_64-unknown-unknown" MCPU="zenver1" opt -verify ${OPT} -mtriple=${MTRIPLE} -mcpu=${MCPU} fail.ll|llc ${OPT} -mtriple=${MTRIPLE} -mcpu=${MCPU}

help ?

Jul 27 '18 16:07 stuartarchibald

That prints the assembly which I get when compiling for skylake (and works).

Jul 27 '18 16:07 cod3monk

I also realized that in the original posting, the two assembly codes differ in the same way as with -01 vs -O2/3. I must have overlooked that.

Jul 27 '18 16:07 cod3monk