Eliminate redundant refcounting in the JIT
Feature or enhancement
Proposal:
Thanks to Matt's work on borrowed LOAD_FAST, we can now eliminate reference counting trivially in the JIT.
Reference counting is expensive, Matt found that eliminating 90% of refcounts in LOAD_FAST meant a 2-3% speedup in the interpreter. So the speedup for the JIT should be quite a bit too.
The other problem is that reference counts block register allocation/TOS caching. As they force spills to the stack more often.
This issue has the potential to speedup up JIT benchmarks by several percent.
How to contribute.
Now that reference tracking in the JIT optimizer is in place, the first thing we need to do is to convert ops to make the decref an explicit uop.
For escaping decref ops (ie, ops that decref could run the GC), we would need to refactor them so their decrefs are eliminated via specialization of pops. For example: the following op (which is not an escaping op, but just purely for demonstration!):
macro(BINARY_OP_ADD_INT) =
_GUARD_TOS_INT + _GUARD_NOS_INT + unused/5 + _BINARY_OP_ADD_INT;
becomes
macro(BINARY_OP_ADD_INT) =
_GUARD_TOS_INT + _GUARD_NOS_INT + unused/5 + _BINARY_OP_ADD_INT + _POP_TOP_INT + _POP_TOP_INT;
Previously _BINARY_OP_ADD_INT's stack effect looked like this: (left, right -- res). The new version should look like (left, right -- res, left, right).
So for all the decref specializations, we would just need a _POP_X of their variants! This means no explosion of uop and their decref variants. We just specialize _POP_X to _POP_X_NO_DECREF in the JIT. Keeping things clean
These are open for contributors to take:
- [X] POP_TOP @Fidget-Spinner
- [X] STORE_FAST @Fidget-Spinner
- [x] _BINARY_OP_X_FLOAT @Fidget-Spinner
- [x] _BINARY_OP_X_INT @Fidget-Spinner
- [x] _BINARY_OP_ADD_UNICODE @corona10
- [x] _BINARY_OP_SUBSCR_LIST_INT @cocolato
- [x] _BINARY_OP_SUBSCR_TUPLE_INT @cocolato
- [x] _BINARY_OP_SUBSCR_STR_INT @savannahostrowski
- [ ] _BINARY_OP_SUBSCR_DICT @caje731
- [x] _COMPARE_OP_X @cocolato
- [x] _CALL_TYPE_1 @tomasr8
- [x] _CALL_STR_1 @Zheaoli
- [x] _CALL_TUPLE_1 @noamcohen97
- [x] _CALL_BUILTIN_O @AndPuQing
- [x] _CALL_LEN @corona10
- [ ] _CALL_ISINSTANCE @noamcohen97
- [x] _CALL_LIST_APPEND
- [x] _CALL_METHOD_DESCRIPTOR_O @corona10
- [x] _STORE_SUBSCR_DICT @corona10
- [x] _STORE_SUBSCR_LIST_INT @corona10
- [x] _LOAD_ATTR_INSTANCE_VALUE @Zheaoli
- [x] _LOAD_ATTR_WITH_HINT @cocolato
- [x] _LOAD_ATTR_SLOT @Zheaoli
- [x] _STORE_ATTR_INSTANCE_VALUE @Zheaoli
- [x] _STORE_ATTR_SLOT @savannahostrowski
- [x] _STORE_ATTR_WITH_HINT @Zheaoli
- [x] IS_OP @cocolato
- [ ] TO_BOOL_ALWAYS_TRUE @reidenong
- [ ] TO_BOOL_INT @reidenong
- [ ] TO_BOOL_LIST @reidenong
- [ ] TO_BOOL_STR @Zheaoli
Once that's done, we can think about further ops.
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response
Linked PRs
- gh-134588
- gh-135761
- gh-135817
- gh-135818
- gh-135860
- gh-135907
- gh-136056
- gh-136070
- gh-136077
- gh-136104
- gh-142576
- gh-142604
- gh-142620
- gh-142695
- gh-142703
- gh-142711
- gh-142712
- gh-142729
- gh-142759
- gh-142765
- gh-142767
- gh-142769
- gh-142825
- gh-142844
- gh-142921
- gh-142926
- gh-142949
- gh-142952
- gh-143062
- gh-143094
- gh-143107
- gh-143171
- gh-143186
- gh-143320
- gh-143330
- gh-143333
- gh-143336
- gh-143417
- gh-143427
@tomasr8 @brandtbucher this is gonna be a pretty sweet optimization :) Before:
//
// _BINARY_OP_ADD_INT.o: file format elf64-x86-64
//
// Disassembly of section .text:
//
// 0000000000000000 <_JIT_ENTRY>:
// 0: 55 pushq %rbp
// 1: 4d 8b 7d f0 movq -0x10(%r13), %r15
// 5: 49 8b 6d f8 movq -0x8(%r13), %rbp
// 9: 4c 89 ff movq %r15, %rdi
// c: 48 83 e7 fe andq $-0x2, %rdi
// 10: 48 89 ee movq %rbp, %rsi
// 13: 48 83 e6 fe andq $-0x2, %rsi
// 17: 4d 89 6c 24 40 movq %r13, 0x40(%r12)
// 1c: ff 15 00 00 00 00 callq *(%rip) # 0x22 <_JIT_ENTRY+0x22>
// 000000000000001e: R_X86_64_GOTPCRELX _PyLong_Add-0x4
// 22: 48 89 c3 movq %rax, %rbx
// 25: 4d 8b 6c 24 40 movq 0x40(%r12), %r13
// 2a: 40 f6 c5 01 testb $0x1, %bpl
// 2e: 75 05 jne 0x35 <_JIT_ENTRY+0x35>
// 30: ff 4d 00 decl (%rbp)
// 33: 74 3f je 0x74 <_JIT_ENTRY+0x74>
// 35: 41 f6 c7 01 testb $0x1, %r15b
// 39: 75 71 jne 0xac <_JIT_ENTRY+0xac>
// 3b: 41 ff 0f decl (%r15)
// 3e: 75 6c jne 0xac <_JIT_ENTRY+0xac>
// 40: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
// 0000000000000042: R_X86_64_64 _PyRuntime+0x28e0
// 4a: 48 8b 00 movq (%rax), %rax
// 4d: 48 85 c0 testq %rax, %rax
// 50: 74 17 je 0x69 <_JIT_ENTRY+0x69>
// 52: 48 b9 00 00 00 00 00 00 00 00 movabsq $0x0, %rcx
// 0000000000000054: R_X86_64_64 _PyRuntime+0x28e8
// 5c: 48 8b 11 movq (%rcx), %rdx
// 5f: 4c 89 ff movq %r15, %rdi
// 62: be 01 00 00 00 movl $0x1, %esi
// 67: ff d0 callq *%rax
// 69: 4c 89 ff movq %r15, %rdi
// 6c: ff 15 00 00 00 00 callq *(%rip) # 0x72 <_JIT_ENTRY+0x72>
// 000000000000006e: R_X86_64_GOTPCRELX _PyLong_ExactDealloc-0x4
// 72: eb 38 jmp 0xac <_JIT_ENTRY+0xac>
// 74: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
// 0000000000000076: R_X86_64_64 _PyRuntime+0x28e0
// 7e: 48 8b 00 movq (%rax), %rax
// 81: 48 85 c0 testq %rax, %rax
// 84: 74 17 je 0x9d <_JIT_ENTRY+0x9d>
// 86: 48 b9 00 00 00 00 00 00 00 00 movabsq $0x0, %rcx
// 0000000000000088: R_X86_64_64 _PyRuntime+0x28e8
// 90: 48 8b 11 movq (%rcx), %rdx
// 93: 48 89 ef movq %rbp, %rdi
// 96: be 01 00 00 00 movl $0x1, %esi
// 9b: ff d0 callq *%rax
// 9d: 48 89 ef movq %rbp, %rdi
// a0: ff 15 00 00 00 00 callq *(%rip) # 0xa6 <_JIT_ENTRY+0xa6>
// 00000000000000a2: R_X86_64_GOTPCRELX _PyLong_ExactDealloc-0x4
// a6: 41 f6 c7 01 testb $0x1, %r15b
// aa: 74 8f je 0x3b <_JIT_ENTRY+0x3b>
// ac: 48 85 db testq %rbx, %rbx
// af: 74 18 je 0xc9 <_JIT_ENTRY+0xc9>
// b1: 0f b7 43 06 movzwl 0x6(%rbx), %eax
// b5: 83 e0 01 andl $0x1, %eax
// b8: 48 09 d8 orq %rbx, %rax
// bb: 49 89 45 f0 movq %rax, -0x10(%r13)
// bf: 49 83 c5 f8 addq $-0x8, %r13
// c3: 5d popq %rbp
// c4: e9 00 00 00 00 jmp 0xc9 <_JIT_ENTRY+0xc9>
// 00000000000000c5: R_X86_64_PLT32 _JIT_CONTINUE-0x4
// c9: 49 83 c5 f0 addq $-0x10, %r13
// cd: 5d popq %rbp
// ce: e9 00 00 00 00 jmp 0xd3 <_JIT_ENTRY+0xd3>
// 00000000000000cf: R_X86_64_PLT32 _JIT_ERROR_TARGET-0x4
After:
//
// _BINARY_OP_ADD_INT__NO_INPUT_DECREF.o: file format elf64-x86-64
//
// Disassembly of section .text:
//
// 0000000000000000 <_JIT_ENTRY>:
// 0: 50 pushq %rax
// 1: 49 8b 7d f0 movq -0x10(%r13), %rdi
// 5: 49 8b 75 f8 movq -0x8(%r13), %rsi
// 9: 48 83 e7 fe andq $-0x2, %rdi
// d: 48 83 e6 fe andq $-0x2, %rsi
// 11: 4d 89 6c 24 40 movq %r13, 0x40(%r12)
// 16: ff 15 00 00 00 00 callq *(%rip) # 0x1c <_JIT_ENTRY+0x1c>
// 0000000000000018: R_X86_64_GOTPCRELX _PyLong_Add-0x4
// 1c: 4d 8b 6c 24 40 movq 0x40(%r12), %r13
// 21: 48 85 c0 testq %rax, %rax
// 24: 74 18 je 0x3e <_JIT_ENTRY+0x3e>
// 26: 0f b7 48 06 movzwl 0x6(%rax), %ecx
// 2a: 83 e1 01 andl $0x1, %ecx
// 2d: 48 09 c1 orq %rax, %rcx
// 30: 49 89 4d f0 movq %rcx, -0x10(%r13)
// 34: 49 83 c5 f8 addq $-0x8, %r13
// 38: 58 popq %rax
// 39: e9 00 00 00 00 jmp 0x3e <_JIT_ENTRY+0x3e>
// 000000000000003a: R_X86_64_PLT32 _JIT_CONTINUE-0x4
// 3e: 49 83 c5 f0 addq $-0x10, %r13
// 42: 58 popq %rax
// 43: e9 00 00 00 00 jmp 0x48 <_JIT_ENTRY+0x48>
// 0000000000000044: R_X86_64_PLT32 _JIT_ERROR_TARGET-0x4
FYI here https://github.com/python/cpython/issues/131798#issuecomment-2997540181
100% want to take a part into this issue. BTW I may need a little bit time to read the whole context and the previous PR made by others.
So maybe I will send the first PR in this weekend. If this is acceptable, 100% want to join into this
@Zheaoli and anyone else reading this thread who wants to contribute. No worries, I'm happy to explain and answer any questions over email or video if you want to. My email is on my GH profile.
No worries, I'm happy to explain and answer any questions over email or video if you want to.
Just for joking: This is 100% facts(lol. @Fidget-Spinner is a supper nice guy(lol
I am also working on _CALL_ISINSTANCE
I have submit the PR for _CALL_STR_1 , I will try _CALL_LIST_APPEND if you guys don't mind it
I have submit the PR for
_CALL_STR_1, I will try_CALL_LIST_APPENDif you guys don't mind it
Thanks. Btw, we're waiting for https://github.com/python/cpython/pull/135465/files to get merged first before we can continue. So you can take a rest first :).
This issue is blocked till we get TOS caching in. So please don't work on it anymore until that is merged. Thank you!
May I ask what is status about TOS caching?(lol
We're still waiting on Mark
We're still waiting on Mark
Got it, I will try to handle another part