llvm-project icon indicating copy to clipboard operation
llvm-project copied to clipboard

bad RVV code generation with `-fvectorize`

Open compnerd opened this issue 1 year ago • 4 comments

This seems to see a ~100x slowdown in qemu in scalar vs vector mode.

Reduced test case from XNNPACK:

#include <algorithm>
#include <chrono>
#include <cstddef>
#include <iostream> 
 
bool z{true};
size_t a{1};
uint32_t b{1};
uint32_t c{1}; 
 
inline size_t select() {
  if (z) return (a + b - 1) / c;
  return (a - b + 1) / c + 1;
} 
 
void __attribute__((__noinline__)) test() {
  int8_t input[100];
  int8_t max_value = 0; 
 
  for (size_t c = 0; c < 3000; c++) {
    for (size_t py = 0; py < 15; py++)
      max_value = std::max(max_value, input[select()]);
    __asm__ __volatile__("" ::"r"(max_value));
  }
} 
 
int main(int argc, const char* argv[]) {
  auto start = std::chrono::system_clock::now();
  for (int i = 0; i < 500; ++i)
    test();
  auto end = std::chrono::system_clock::now();
  std::cout << "Elapsed time: "
            << std::chrono::duration<double>(end - start).count() << "s\n";
  return 0;
}

CC: @preames

compnerd avatar Oct 14 '22 16:10 compnerd

@llvm/issue-subscribers-backend-risc-v

llvmbot avatar Oct 14 '22 16:10 llvmbot

I'd taken a quick look at this before filing. The following is my first impression from a quick scan; details may be off.

I think what we're seeing is the interaction of several issues here. I'm going to list them; no idea of relative importance yet.

  1. We're somehow ending up with LMUL8 code here. We're supposed to be limited to LMUL1 at the moment, so this is odd. This causes horrible stack fill/spill problems.
  2. The speculated (via predication) two sides of a branch which each contained a udiv. This should have been commoned earlier, and probably hoisted out of the loop entirely. The fact it hasn't even if the scalar loop is quite odd. Really, this shouldn't even be a loop at all.
  3. The selects of i1 vectors have obvious simplification which were missed. I don't understand this at all.
  4. The codegen for some of the selects appears bad. Not sure if this is a side effect of the regalloc behavior though.

I will note that for my purposes I'm treating this purely as a source of interesting code gen examples. The relative performance on qemu is out of any scope I care about, and I will not be addressing that point at all. I've been told qemu has limited support for jitting vector code, and figuring out if that is relevant is well out of scope for an llvm bug.

preames avatar Oct 14 '22 16:10 preames

I will note that for my purposes I'm treating this purely as a source of interesting code gen examples.

Yes, it is meant to be an example of interesting code generation. The qemu performance is just a means to identify something interesting was going on. This code was reduced pretty heavily to make it easier to reason about.

compnerd avatar Oct 14 '22 16:10 compnerd

https://godbolt.org/z/41W8qeeoM

preames avatar Oct 14 '22 17:10 preames