llvm-project icon indicating copy to clipboard operation
llvm-project copied to clipboard

LLVM/Clang got super slow when it got compiled with Polly

Open ms178 opened this issue 2 years ago • 11 comments

For about a week or so, I've noticed a huge slowdown in compilation speed of LLVM/Clang when my build got compiled with the following flags (my CPU = Haswell):

export CC=clang
export CXX=clang++
export CC_LD=lld
export CXX_LD=lld
export AR=llvm-ar
export NM=llvm-nm
export STRIP=llvm-strip
export OBJCOPY=llvm-objcopy
export OBJDUMP=llvm-objdump
export READELF=llvm-readelf
export RANLIB=llvm-ranlib
export HOSTCC=clang
export HOSTCXX=clang++
export HOSTAR=llvm-ar
export CFLAGS="-O3 -march=native -mllvm -polly -mllvm -polly-position=early -mllvm -polly-parallel=true -fopenmp -fopenmp-version=50 -mllvm -polly-dependences-computeout=0 -mllvm -polly-detect-profitability-min-per-loop-insts=40 -mllvm -polly-tiling=true -mllvm -polly-prevect-width=256 -mllvm -polly-vectorizer=stripmine -mllvm -polly-omp-backend=LLVM -mllvm -polly-num-threads=36 -mllvm -polly-scheduling=dynamic -mllvm -polly-scheduling-chunksize=1 -mllvm -polly-ast-use-context -mllvm -polly-invariant-load-hoisting -mllvm -polly-loopfusion-greedy -mllvm -polly-run-inliner -mllvm -polly-run-dce -mllvm -polly-enable-delicm=true -mllvm -extra-vectorizer-passes -mllvm -enable-cond-stores-vec -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -interleave-small-loop-scalar-reduction -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -fno-math-errno -fno-trapping-math -falign-functions=32 -fno-semantic-interposition -fcf-protection=none -mharden-sls=none -fomit-frame-pointer -flto"
export CXXFLAGS="${CFLAGS}"
export LDFLAGS="-Wl,--lto-O3,-O3,-Bsymbolic-functions,--as-needed -Wl,-mllvm,-march=native -mllvm -extra-vectorizer-passes -mllvm -enable-cond-stores-vec -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -interleave-small-loop-scalar-reduction -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -flto -fuse-ld=lld"
export ASFLAGS="-D__AVX__=1 -D__AVX2__=1 -msse2avx -D__FMA__=1"

Here is my build configuration: PKGBUILD.txt [edit: Sorry, had uploaded the wrong file at first]

If I delete all Polly-related flags in CFLAGS, the compilation speed of the produced binaries goes back to normal. Using FullLTO is intentional here, but the slowdown was reported to me also when using ThinLTO. The mentioned Polly-flags used to work just fine before.

ms178 avatar Aug 11 '22 10:08 ms178

For about a week or so, I've noticed a huge slowdown in compilation speed of LLVM/Clang when

Could you confirm this is a recent regression on main? Would it be possible for you to try to bisect back to the commit that introduced the regression? Also what Clang version are you using to build LLM?

fhahn avatar Aug 11 '22 10:08 fhahn

Yes, sorry that I cannot be too specific on this, but it should be a recent regression on main as I compile LLVM/Clang with a former main snapshot. I can re-test with current main to see if it is still reproducable, but a friend tested my Polly flags out yesterday and reported back to me that the issue is still there.

As for bisecting, as my PKGBUILD targets llvm-git automatically, I am sorry, I lack the knowledge to edit it to get back to a specific former committ.

ms178 avatar Aug 11 '22 10:08 ms178

As for bisecting, as my PKGBUILD targets llvm-git automatically, I am sorry, I lack the knowledge to edit it to get back to a specific former committ.

I think what you would have to do is clone https://github.com/llvm/llvm-project.git, then create a build directory and run the commands from the build() step in the PKGFILE.

Any help with reducing it to a commit that introduced the regression would be really helpful to resolve the issue, as I think it is a quite uncommon build configuration and it might take a substantial amount of time for someone to take a look.

fhahn avatar Aug 11 '22 10:08 fhahn

I've now reproduced the issue using clang version 16.0.0 (2cb51449f0d9ed06de87b4a47b5074eb6eec2e23) as the base compiler with ThinLTO and the rest of the flags from above. Building LLVM/Clang (14913fa5d0508c3e29ae69552ef0710c82913a75) that way took around 20 Min. But even though CCache is used, the produced LLVM/Clang is significantly slower and only got 1200 files out of 4272 in 11 Minutes. From monitoring via htop, it can be seen that very little RAM (only around 3.4 - 4.1 GB) is used but the CPU load average is significantly higher (which is about 35 - 80 with the first build and constantly around 220 - 235 with the second build).

ms178 avatar Aug 11 '22 11:08 ms178

@llvm/issue-subscribers-polly

llvmbot avatar Aug 11 '22 13:08 llvmbot

Can you identify the file one file that takes long to compile? With this file, add -polly-dump-before which will write a file with -before.ll suffix into the current dir. The file would help identifying what the issue is.

Meinersbur avatar Aug 12 '22 19:08 Meinersbur

I'll try to provide such a file later this week, just for clarification it is not just the resulting LLMV/Clang compiler that is slowed down, even the CMake checks when using that compiler take noticeably longer. Hence it can be noticed almost immediatly at the end of the build process, as the OpenMP runtime gets build with the just-built LLVM/Clang which is noticeably slower than without using the Polly-flags for the build. But building any project with that "poisened" compiler is slower than before, compiling LLVM/Clang itself is the best show case though, I have also seen this when trying to build Mesa or the Kernel using that poisened compiler

ms178 avatar Aug 12 '22 22:08 ms178

The interesting part is that this behavior only shows up when LLVM/Clang itself got compiled with the Polly-flags from above, I can still use these Polly-flags without problems on other projects (e.g. Mesa) and haven't noticed anything wrong (nor performance anomalies) with the produced binaries when the compiler was not compiled with the mentioned Polly-flags.

ms178 avatar Aug 12 '22 22:08 ms178

-mllvm -polly-dependences-computeout=0 -mllvm -polly-detect-profitability-min-per-loop-insts=40

These are options that will make compilation slow. The former removes the bailout that stops Polly from spending a lot of time on solving equation systems. The second enables the loop nest optimizer even for non-nested loops. These options will make compilation slow by design.

Meinersbur avatar Aug 16 '22 15:08 Meinersbur

@Meinersbur Yes, I know - I did my research on these options. Maybe I wasn't able to articulate the problem more clearly, but the compilation speed with these options on a "good" LLVM/Clang is not problematic at all. The slowness only shows while using these flags on LLVM/Clang itself and using the just-built poisened compiler for everything else, regardless of the flags used at that point. That poisened compiler is significantly slower on everything, not just when using the Polly-Flags you mentioned. For example, a Mesa-FLTO-build does start to take over an hour now which used to take 20 - 25 Minutes while the same set of flags were used. Another example would be the Linux-Full-LTO-Kernel which now also took over an hour to complete.

And to be clear, using the Polly-flags on an LLVM/Clang build used to produce a reasonably well-working LLVM/Clang some weeks ago and now I get a crawling slow LLVM/Clang. If it helps, I can upload such a poisened LLVM/Clang for further analysis.

ms178 avatar Aug 16 '22 16:08 ms178

It sounds like some specific pass is getting more expensive due to runtime checks, or something like that. You're going to have to figure out what that pass is. Maybe -ftime-passes would show something.

efriedma-quic avatar Aug 16 '22 17:08 efriedma-quic

@Meinersbur @efriedma-quic

Here are two CmakeOutput-Logs from compiling sddm with LLVM-builds from today with -ftime-report. With the normal behaving toolchain, it took 18 seconds to complete, however with the poisened toolchain took more than double, 42 seconds. Detailed timings for the passes are in the logs, also the same set of flags were used for the compilation of sddm and are also specified in the logs.

CMakeOutput_NORMAL.log (llvm-git-16.0.0_r433204.164266739298-1-x86_64) CMakeOutput_POISENED.log (llvm-git-16.0.0_r433291.9ad0ace2ba52-1-x86_64_POISENED)

ms178 avatar Aug 18 '22 13:08 ms178

Here are several files of the /src/build/test directory made with the POISENED toolchain and -polly-dump-before - I hope it helps: ConfigReader-before.txt ConfigurationTest-before.txt mocs_compilation-before.txt

ms178 avatar Aug 18 '22 13:08 ms178

For further analysis, here is a link to download the binaries of the POISENED toolchain: https://1drv.ms/u/s!Agwwh-axGk6DglWllL3tX7Nltw8h?e=UAj8Rx

Note: A Haswell or compatible CPU is required

ms178 avatar Aug 18 '22 13:08 ms178

Can't reproduce with a recent LLVM-17 build with slightly different set of CFLAGS (without polly-parallel).

export CFLAGS="-O3 -march=native -mtune=native -maes -mllvm -inline-threshold=1000 -mllvm -polly -mllvm -polly-position=early -mllvm -polly-dependences-computeout=500000 -mllvm -polly-detect-profitability-min-per-loop-insts=40 -mllvm -polly-tiling=true -mllvm -polly-prevect-width=256 -mllvm -polly-vectorizer=stripmine -mllvm -polly-scheduling=dynamic -mllvm -polly-scheduling-chunksize=1 -mllvm -polly-invariant-load-hoisting -mllvm -polly-loopfusion-greedy -mllvm -polly-run-inliner -mllvm -polly-run-dce -mllvm -polly-enable-delicm=true -mllvm -extra-vectorizer-passes -mllvm -enable-cond-stores-vec -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -interleave-small-loop-scalar-reduction -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -mllvm -enable-interleaved-mem-accesses -mllvm -enable-masked-interleaved-mem-accesses -fno-math-errno -fno-trapping-math -falign-functions=32 -funroll-loops -fno-semantic-interposition -fcf-protection=none -mharden-sls=none -fomit-frame-pointer -mprefer-vector-width=256 -flto"
export CXXFLAGS="${CFLAGS}"
export LDFLAGS="-Wl,-O3,-Bsymbolic-functions,--as-needed -march=native -mllvm -extra-vectorizer-passes -mllvm -enable-cond-stores-vec -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -interleave-small-loop-scalar-reduction -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -mllvm -enable-interleaved-mem-accesses -mllvm -enable-masked-interleaved-mem-accesses -maes -flto -fuse-ld=mold -Wl,-zmax-page-size=0x200000"

ms178 avatar Apr 08 '23 22:04 ms178