blink
blink copied to clipboard
ppc64le JIT
This isn't a request to write it; I can write this. The question is whether it would be accepted (the comments in jit.c
seem to imply the desire is to keep it x86_64 and aarch64 only, perhaps I've read them wrong).
I would love to see Blink be able to JIT ppc64le. Especially if being able to do so doesn't increase the binary footprint of our x86-64 and aarch64 builds. If you can help us do that, then please join our Discord and have fun hacking with us! https://discord.gg/vFdkMdQN
I hope ppc64le support will be
This issue hasn't been updated in a while, so I intend to make an announcement.
I will implement JIT support for the IBM OpenPOWER architecture if someone donates to me either the Talos™ II 2U Rack Mount Server or Talos™ II Desktop Development System. The rack mounted one might be better, since it has 36 cores and would therefore let me compile code faster for all our users. It would cost $10,669.99 and I could deliver top-notch x86_64 JITing for POWER users in less than a month, made freely available under an ISC license.
I've wrestled off and on with this for awhile and I'm blocked on a crash I can't resolve. If I may gently prod, the code that apparently needs to be updated for bringing up a new JIT is in multiple places and they aren't always obviously marked, so I've probably missed something I don't know to fix. This is the current patch, with several things commented out that don't work yet but if I read it right should only affect the quality of generated code, not its functionality.
With gdb --args o//blink/blink -es build/bootstrap/mkdeps.com
it ends up bombing out in OpStos
with evidence of stack corruption (can't unwind past ExecuteInstruction
) after executing for awhile, so this is tough to debug. I tried to pattern it after aarch64
but there were some places that generated ARM64 code which weren't clearly doing so, ahem.
If you wouldn't mind having a look at the patch, where have I missed? It codegens fine and starts execution, so the rudiments work. It is limited to ppc64le
but it may work fine on big-endian ppc64
when this is done.
I did notice that I got IsRet
wrong and fixed that to be more like ARM, but that isn't the problem here. The current crash looks like this:
% gdb --args o//blink/blink -es build/bootstrap/mkdeps.com
GNU gdb (GDB) Fedora Linux 13.1-4.fc38
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from o//blink/blink...
(gdb) run
Starting program: /home/spectre/src/blink/o/blink/blink -es build/bootstrap/mkdeps.com
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
I2023-06-01T10:40:25.265590:blink/loader.c:708:816742 (sys) LoadProgram build/bootstrap/mkdeps.com
I2023-06-01T10:40:25.265771:blink/loader.c:100:816742 (sys) PT_LOAD R.X [400000,42e000) build/bootstrap/mkdeps.com
I2023-06-01T10:40:25.265854:blink/loader.c:100:816742 (sys) PT_LOAD RW. [42e000,456000) build/bootstrap/mkdeps.com
FuseBranchCmp
FuseBranchTest
FuseBranchTest
FuseBranchTest
FuseBranchCmp
FuseBranchTest
FuseBranchTest
FuseBranchTest
FuseBranchCmp
FuseBranchCmp
FuseBranchTest
FuseBranchTest
FuseBranchTest
Program received signal SIGSEGV, Segmentation fault.
0x0000000100048fc0 in StringOp (m=0x101ff7230, rde=297425592599457792, disp=0,
uimm0=0, op=op@entry=2) at blink/string.c:145
145 switch (op) {
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.37-4.fc38.ppc64le zlib-1.2.13-3.fc38.ppc64le
(gdb) bt
#0 0x0000000100048fc0 in StringOp (m=0x101ff7230, rde=297425592599457792,
disp=0, uimm0=0, op=op@entry=2) at blink/string.c:145
#1 0x000000010004987c in OpStos (m=<optimized out>, rde=<optimized out>,
disp=<optimized out>, uimm0=<optimized out>) at blink/string.c:301
#2 0x00000001000de008 in g_code ()
#3 0x0000000100029b00 in ExecuteInstruction (m=0x101ff7230)
at blink/machine.c:2205
#4 ExecuteInstruction (m=0x101ff7230) at blink/machine.c:2194
#5 0x00004fffffffeb18 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
And the disassembly going up to OpStos
. This looks pretty normal, so the foul probably occurred earlier.
(gdb) disas 0x00000001000de008-0x40, 0x00000001000de008+0x40
Dump of assembler code from 0x1000ddfc8 to 0x1000de048:
0x00000001000ddfc8 <g_code+171544>: lis r6,0
0x00000001000ddfcc <g_code+171548>: ori r6,r6,0
0x00000001000ddfd0 <g_code+171552>: sldi r6,r6,32
0x00000001000ddfd4 <g_code+171556>: oris r6,r6,0
0x00000001000ddfd8 <g_code+171560>: ori r6,r6,0
0x00000001000ddfdc <g_code+171564>: lis r5,0
0x00000001000ddfe0 <g_code+171568>: ori r5,r5,0
0x00000001000ddfe4 <g_code+171572>: sldi r5,r5,32
0x00000001000ddfe8 <g_code+171576>: oris r5,r5,0
0x00000001000ddfec <g_code+171580>: ori r5,r5,0
0x00000001000ddff0 <g_code+171584>: lis r4,1056
0x00000001000ddff4 <g_code+171588>: ori r4,r4,43776
0x00000001000ddff8 <g_code+171592>: sldi r4,r4,32
0x00000001000ddffc <g_code+171596>: oris r4,r4,10752
0x00000001000de000 <g_code+171600>: ori r4,r4,12288
0x00000001000de004 <g_code+171604>: bl 0x100049858 <OpStos>
0x00000001000de008 <g_code+171608>: lis r5,0
0x00000001000de00c <g_code+171612>: ori r5,r5,0
0x00000001000de010 <g_code+171616>: sldi r5,r5,32
0x00000001000de014 <g_code+171620>: oris r5,r5,0
0x00000001000de018 <g_code+171624>: ori r5,r5,1638
0x00000001000de01c <g_code+171628>: lis r6,0
0x00000001000de020 <g_code+171632>: ori r6,r6,0
0x00000001000de024 <g_code+171636>: sldi r6,r6,32
0x00000001000de028 <g_code+171640>: oris r6,r6,0
0x00000001000de02c <g_code+171644>: ori r6,r6,1638
0x00000001000de030 <g_code+171648>: lis r7,0
0x00000001000de034 <g_code+171652>: ori r7,r7,0
0x00000001000de038 <g_code+171656>: sldi r7,r7,32
0x00000001000de03c <g_code+171660>: oris r7,r7,0
0x00000001000de040 <g_code+171664>: ori r7,r7,1638
0x00000001000de044 <g_code+171668>: lis r8,0
End of assembler dump.
This is very exciting news! I'll have time to respond in the next few days. We're also talking about your contribution on our Discord. https://discord.gg/HQNA9faw We'd love if you joined us!
Hello @classilla,
I am not familiar with the PowerPC assembly or debugging — but are you able to dump the state of the registers at the crash site?
Incidentally, I noticed that, when I tried running o/test/asm/add.com
under o/powerpc64le/blink/blink
(with QEMU emulation), I would get a
blink/jit.c:1919:14593 assertion failed: !(disp & 0x03) (0)
PC 12b8c9012f0c mov %rax,0x30(%rsp) 48 89 44 24 30 48 8d 05
Some further exploration suggests that this was caused by Blink trying to insert a jump from an OomJit()
address to some other place.
Thank you!
Yeah, I can reproduce that. I'm trying to find where that's set off (again, is there some other section of the JIT that I've missed?).
Looks like the crash was caused by TOC getting stomped on certain calls. I was hoping to avoid setting r12 to the destination address on every call but this seems unavoidable. It gets further now.
What should OomJit()
point to? How did that ever work for aarch64
?
Hello @classilla,
What should
OomJit()
point to? How did that ever work for aarch64?
See OomJit()
in blink/jit.c
.
OK, I think I know what is going on. When the AArch64 JITter (e.g.) finds that there is no more space in the JIT buffer,
- it will point
jb->index
to beyond the end of the buffer -
AppendJit()
will refuse to write anything further to the JIT buffer -
FlushCod()
(blink/path.c
) will later discard the incompletely written-out code.
In such cases it is OK for the JIT location counter to be unaligned, since the code will be discarded anyway. So perhaps instead of
unassert(!(disp & 0x03));
you could just say
unassert(!(disp & 0x03) || jb->index > kJitBlockSize);
Thank you!
This is the current checkpoint. It is enough to execute o//blink/blink build/bootstrap/mkdeps.com
and many of the tests (in particular cosmo/2/test_suite_md.com
and cosmo/2/test_suite_mpi.com
are indeed 5-6x faster), but other tests that should pass quickly seem to hang indefinitely. This includes o//blink/blink third_party/cosmo/2/palandprintf_test.com
and o//blink/blink third_party/cosmo/2/cos_test.com
. Is there something weird about floating point I haven't accounted for?
The code it generates now is pretty good, but for many of the micro ops it seems unnecessary to load r12
since they don't reference the TOC, and it would be nice to eliminate it for those quick load/store/gimme-register functions which get called a lot. Maybe I can come up with a white list. Some of those functions are single instructions and seem ideal for inlining if that ever becomes a thing. About the only thing missing is the inability to fuse overflow checks because we have to go to XER for that, not the regular condition register fields.
Is there a way to debug calls?
Hello @classilla,
Do you mean you want to step into a function that is being called? You can probably use GDB or LLDB's step
and/or stepi
commands for that (unless I am missing something).
Thank you!
No (I'm well aware of what those do, probably wouldn't have been able to write anything without them ;-). What I want is to instrument what x86 instructions map to what blocks of generated code so I can understand where the infinite loop is coming from. If this isn't easily possible, I may put this aside for awhile again, since I don't have any further way to understand the tests that fail.
Hello @classilla,
What I want is to instrument what x86 instructions map to what blocks of generated code so I can understand where the infinite loop is coming from.
It might be helpful to dump m->ip
— m
normally goes into the register kJitSav0
— to get an idea of which basic block in the guest code is being run.
(In non-JIT mode, m->ip - m->oplen
should give precise guest %rip
values, but as the README
explains, JITted code may try not to update m->ip
unless really necessary.)
Some of those functions are single instructions and seem ideal for inlining if that ever becomes a thing.
The x86-64 JITter does know how to inline the more "trivial" micro-ops into the JIT stream. See the implementation of CallMicroOp( )
in blink/uop.c
. The AArch64 JITter does not do this yet, but I am working on implementing it (https://github.com/jart/blink/pull/145). You could probably do something similar.
Thank you!
I eventually started stepping through the code with blinkenlights -j third_party/cosmo/2/cos_test.com
to see where it diverges from a non-JIT run. It ends up making three normal calls to dtoa
but the fourth call is where it goes haywire.
The code gets to 00414d9c * mov %rax,%r14
. On the non-JIT run, single stepping goes to 00414d9f mov %r13d,%esi
(the next instruction), as expected, but on the JIT run a single step immediately jumps to 00415400 movl $1,-0x8c(%rbp)
.
That doesn't make any sense. Did I forget to convert a section of code in my patch?
Hello @classilla,
This is probably expected and OK. In JIT mode, "single stepping" will not really step through just the next instruction, but instead it will run through an entire translated basic block. If the guest state is correct by the time the guest reaches %rip
= 0x415400
then there should be no problem.
Thank you!
That's going to be a problem, because the divergence occurs in that entire segment it flies through (the guest state is not correct at the end of the basic block).