popcorn-compiler Soft floating-point emulation routines leaking into aarch64

GCC soft float library routines are leaking into musl, which can cause unhandled level 3 translation faults in programs. E.g., NPB's EP:

Feb 20 09:05:11 fox6 kernel: [66009.923530] EP_A[3919]: unhandled level 3 translation fault (11) at 0x00522330, esr 0x83000007
Feb 20 09:05:11 fox6 kernel: [66009.923538] pgd = ffff810f42255000
Feb 20 09:05:11 fox6 kernel: [66009.927041] [00522330] *pgd=0000010f13e68003, *pud=0000010f1c041003, *pmd=0000010f2017a003, *pte=0000000000000000
Feb 20 09:05:11 fox6 kernel: [66009.937485]
Feb 20 09:05:11 fox6 kernel: [66009.937494] CPU: 7 PID: 3919 Comm: EP_A Tainted: G           OE   4.4.55+ #17
Feb 20 09:05:11 fox6 kernel: [66009.937497] Hardware name: Penguin Computing Valkre 2000/MT60-SC4, BIOS T48 10/02/2017
Feb 20 09:05:11 fox6 kernel: [66009.937501] task: ffff810f13e51a00 ti: ffff810f1bdac000 task.ti: ffff810f1bdac000
Feb 20 09:05:11 fox6 kernel: [66009.937506] PC is at 0x522330
Feb 20 09:05:11 fox6 kernel: [66009.937509] LR is at 0x508004
Feb 20 09:05:11 fox6 kernel: [66009.937512] pc : [<0000000000522330>] lr : [<0000000000508004>] pstate: a0000000
Feb 20 09:05:11 fox6 kernel: [66009.937515] sp : 00007fffffbfc8c0
Feb 20 09:05:11 fox6 kernel: [66009.937518] x29: 00007fffffbfc8d0 x28: 0000000000000000
Feb 20 09:05:11 fox6 kernel: [66009.937522] x27: 0000000000601061 x26: 0000000000000000
Feb 20 09:05:11 fox6 kernel: [66009.937527] x25: 00007fffffbfea80 x24: 0000000000601066
Feb 20 09:05:11 fox6 kernel: [66009.937532] x23: 000000007fffffff x22: 0000000000000000
Feb 20 09:05:11 fox6 kernel: [66009.937536] x21: 000000000000000f x20: 0000000000000001
Feb 20 09:05:11 fox6 kernel: [66009.937541] x19: 00007fffffbfcb30 x18: 00007fffffbfcaf0
Feb 20 09:05:11 fox6 kernel: [66009.937545] x17: 00007fffffbfcb5f x16: 0000000000604000
Feb 20 09:05:11 fox6 kernel: [66009.937550] x15: 000000000000000f x14: 0000000000601320
Feb 20 09:05:11 fox6 kernel: [66009.937554] x13: 000000000000003a x12: 000000007fffffff
Feb 20 09:05:11 fox6 kernel: [66009.937559] x11: 000000000000000a x10: 00000000ffffff90
Feb 20 09:05:11 fox6 kernel: [66009.937563] x9 : 00007fffffbfec70 x8 : 00007fffffbfebf0
Feb 20 09:05:11 fox6 kernel: [66009.937568] x7 : 0000000000000001 x6 : 000000000060102f
Feb 20 09:05:11 fox6 kernel: [66009.937572] x5 : 000000000060102f x4 : 0000000000000001
Feb 20 09:05:11 fox6 kernel: [66009.937577] x3 : 0000000000012889 x2 : 00007fffffbfe9e0
Feb 20 09:05:11 fox6 kernel: [66009.937581] x1 : 0000000000000019 x0 : 00007fffffbfcb30

Running objdump shows:

0000000000522330 <__extenddftf2>:
  522330:       9e660000        fmov    x0, d0
  522334:       d53b4401        mrs     x1, fpcr
  522338:       d374f801        ubfx

As for where this gets called -- one call-chain (there may be others) is through pop_arg in musl-1.18/src/stdio/vfprintf.c at line 128:

108    static void pop_arg(union arg *arg, int type, va_list *ap)
109    {
110    	switch (type) {
111    	       case PTR:	arg->p = va_arg(*ap, void *);
112    	break; case INT:	arg->i = va_arg(*ap, int);
113    	break; case UINT:	arg->i = va_arg(*ap, unsigned int);
114    	break; case LONG:	arg->i = va_arg(*ap, long);
115    	break; case ULONG:	arg->i = va_arg(*ap, unsigned long);
116    	break; case ULLONG:	arg->i = va_arg(*ap, unsigned long long);
117    	break; case SHORT:	arg->i = (short)va_arg(*ap, int);
118    	break; case USHORT:	arg->i = (unsigned short)va_arg(*ap, int);
119    	break; case CHAR:	arg->i = (signed char)va_arg(*ap, int);
120    	break; case UCHAR:	arg->i = (unsigned char)va_arg(*ap, int);
121    	break; case LLONG:	arg->i = va_arg(*ap, long long);
122    	break; case SIZET:	arg->i = va_arg(*ap, size_t);
123    	break; case IMAX:	arg->i = va_arg(*ap, intmax_t);
124    	break; case UMAX:	arg->i = va_arg(*ap, uintmax_t);
125    	break; case PDIFF:	arg->i = va_arg(*ap, ptrdiff_t);
126    	break; case UIPTR:	arg->i = (uintptr_t)va_arg(*ap, void *);
127    	break; case DBL:	arg->f = va_arg(*ap, double);
128    	break; case LDBL:	arg->f = va_arg(*ap, long double);
129    	}
130    }

This line appears to get executed whenever printf (or related functions) gets called with the %lf conversion specifier (based on a gdb trace I performed). I tried removing the long modifier (l), but that didn't seem to change the behavior.

I've been investigating how to address this -- ideally, we could eliminate quad precision (128-bit floating point) types and force long doubles to be the same size as doubles (which is allowed by the C standard). That said, I have no idea how to achieve that as of right now.

Feb 20 '18 15:02 acarno

This patch should fix the issue (at least temporarly):

diff --git a/tool/alignment/pyalign/Linker.py b/tool/alignment/pyalign/Linker.py
index 710d6d3..64494b9 100644
--- a/tool/alignment/pyalign/Linker.py
+++ b/tool/alignment/pyalign/Linker.py
@@ -90,6 +90,14 @@ class Linker:
                for symbolName in Globals.SYMBOLS_BLACKLIST[section]:
                        output_buffer.append("\t*(" + symbolName + ");\n")

+             # *This solve the problem of compiler inserted function: float function such as __extenddftf2
+             # (these function lands in the default *text section at the end of other function specific fuunctions)
+             # This leads to section with different size on Xeon and ARM(bigger) binaries which in turn leads to
+             # a segfault, since the DSM check the address at origin and when not found, trigger the fault!
+             # *The solution is to make sure that both section (on ARM and Xeon) have the same size.
+             # This alignement sould do just that (at least for many cases!)
+             output_buffer.append("\t. = ALIGN(0x100000);\n")
+
              # Section "closing" part
              output_buffer.append("}\n")

Feb 20 '18 15:02 mohamed-karaoui

@mohamed-karaoui That did the trick. Maybe submit that as a patch via pull request? Thanks!!

Feb 20 '18 16:02 acarno

I know this patch fixes this particular problem for EP, but it's not very robust (plus having gigantic alignment directives all over the place balloons the size of the binary). @olivierpierre is there a way we can accumulate the architecture-specific & blacklisted symbol sizes in order to safely align the next section to the correct page across all architectures?

Feb 20 '18 16:02 rlyerly

I'll have to look at this, can we attach the sources for a test case here? In the meantime, I recommend using Mohamed solution, some ideas to make it lighter:

Potentially the amount of padding can be reduced imo, I doubt arm is adding 1MB of code at the end of the .text section
It seems this padding can be added only at the end of selected section (.text., any other?)

Feb 20 '18 21:02 olivierpierre

@olivierpierre Any program will do, but here's a simple hello world program for fun :)

More accurately, anything compiled with Popcorn's musl library will do, since that's where the soft float routines originate.

test.tar.gz

Feb 20 '18 21:02 acarno

I believe I've found a better (?) fix that eliminates the soft float libraries altogether. Based on a comment on a StackOverflow question I asked and a post on the llvm-dev listserv, I tracked down the following code in <LLVM_SRC>/tools/clang/lib/Basic/Targets.cpp (lines 4950-4991):

class AArch64TargetInfo : public TargetInfo {
  virtual void setDescriptionString() = 0;
  static const TargetInfo::GCCRegAlias GCCRegAliases[];
  static const char *const GCCRegNames[];

  enum FPUModeEnum {
    FPUMode,
    NeonMode
  };

  unsigned FPU;
  unsigned CRC;
  unsigned Crypto;

  static const Builtin::Info BuiltinInfo[];

  std::string ABI;

public:
  AArch64TargetInfo(const llvm::Triple &Triple)
      : TargetInfo(Triple), ABI("aapcs") {

    if (getTriple().getOS() == llvm::Triple::NetBSD) {
      WCharType = SignedInt;

      // NetBSD apparently prefers consistency across ARM targets to consistency
      // across 64-bit targets.
      Int64Type = SignedLongLong;
      IntMaxType = SignedLongLong;
    } else {
      WCharType = UnsignedInt;
      Int64Type = SignedLong;
      IntMaxType = SignedLong;
    }

    LongWidth = LongAlign = PointerWidth = PointerAlign = 64;
    MaxVectorAlign = 128;
    MaxAtomicInlineWidth = 128;
    MaxAtomicPromoteWidth = 128;

    LongDoubleWidth = LongDoubleAlign = SuitableAlign = 128;
    LongDoubleFormat = &llvm::APFloat::IEEEquad;

By changing the last two lines here, I was able to force the long double type to 64 bits (the same size as a double) on ARM64:

    LongDoubleWidth = LongDoubleAlign = SuitableAlign = 64;
    LongDoubleFormat = &llvm::APFloat::IEEEdouble;

This removed all reference to __extenddftf2 and (it appears) all other soft float library routines.

So far, I've tested this by:

Successfully rebuilding the compiler
Successfully rebuilding the Popcorn libraries (musl, stack transformation, and migration)
Successfully building an NPB binary (CG, class A, make-popcorn-explicit branch)
Successfully running & migrating the binary

I believe further testing is probably needed (I'll be working with this modified version of the compiler for a while to see if any bugs pop up), but as of right now it's looking promising.

Mar 06 '18 23:03 acarno

This is awesome, thanks! Once you verify that it works, you can submit a pull request or let me know if you want me to implement. Just a couple of mental notes here that I don't want to forget:

We should make this command-line selectable if possible, so that we can enable automatically with "-popcorn-*" but not force users to use it if they don't want. We also need to document that this is enabled with the Popcorn flags
This fix needs to be applied to all architectures, including PowerPC & X86 (& maybe RISCV if we go that direction)

Mar 07 '18 16:03 rlyerly

@rlyerly I'm going to test it for another day or two, but I'll try to put together a pull request for at least the ARM changes.

Mar 07 '18 16:03 acarno

Additional note: requires modifying arch/aarch64/bits/float.h in musl (replace every field except DECIMAL_DIG with values from arch/arm/bits/float.h.

Apr 10 '18 23:04 acarno

popcorn-compiler popcorn-compiler copied to clipboard

Soft floating-point emulation routines leaking into aarch64

popcorn-compiler
popcorn-compiler copied to clipboard