cpython icon indicating copy to clipboard operation
cpython copied to clipboard

Eliminate redundant refcounting in the JIT

Open Fidget-Spinner opened this issue 7 months ago • 35 comments

Feature or enhancement

Proposal:

Thanks to Matt's work on borrowed LOAD_FAST, we can now eliminate reference counting trivially in the JIT.

Reference counting is expensive, Matt found that eliminating 90% of refcounts in LOAD_FAST meant a 2-3% speedup in the interpreter. So the speedup for the JIT should be quite a bit too.

The other problem is that reference counts block register allocation/TOS caching. As they force spills to the stack more often.

This issue has the potential to speedup up JIT benchmarks by several percent.

How to contribute.

Now that reference tracking in the JIT optimizer is in place, the first thing we need to do is to convert ops to make the decref an explicit uop.

For escaping decref ops (ie, ops that decref could run the GC), we would need to refactor them so their decrefs are eliminated via specialization of pops. For example: the following op (which is not an escaping op, but just purely for demonstration!):

macro(BINARY_OP_ADD_INT) =
_GUARD_TOS_INT + _GUARD_NOS_INT + unused/5 + _BINARY_OP_ADD_INT;

becomes

macro(BINARY_OP_ADD_INT) =
_GUARD_TOS_INT + _GUARD_NOS_INT + unused/5 + _BINARY_OP_ADD_INT + _POP_TOP_INT + _POP_TOP_INT;

Previously _BINARY_OP_ADD_INT's stack effect looked like this: (left, right -- res). The new version should look like (left, right -- res, left, right). So for all the decref specializations, we would just need a _POP_X of their variants! This means no explosion of uop and their decref variants. We just specialize _POP_X to _POP_X_NO_DECREF in the JIT. Keeping things clean

These are open for contributors to take:

  • [X] POP_TOP @Fidget-Spinner
  • [X] STORE_FAST @Fidget-Spinner
  • [x] _BINARY_OP_X_FLOAT @Fidget-Spinner
  • [x] _BINARY_OP_X_INT @Fidget-Spinner
  • [x] _BINARY_OP_ADD_UNICODE @corona10
  • [x] _BINARY_OP_SUBSCR_LIST_INT @cocolato
  • [x] _BINARY_OP_SUBSCR_TUPLE_INT @cocolato
  • [x] _BINARY_OP_SUBSCR_STR_INT @savannahostrowski
  • [ ] _BINARY_OP_SUBSCR_DICT @caje731
  • [x] _COMPARE_OP_X @cocolato
  • [x] _CALL_TYPE_1 @tomasr8
  • [x] _CALL_STR_1 @Zheaoli
  • [x] _CALL_TUPLE_1 @noamcohen97
  • [x] _CALL_BUILTIN_O @AndPuQing
  • [x] _CALL_LEN @corona10
  • [ ] _CALL_ISINSTANCE @noamcohen97
  • [x] _CALL_LIST_APPEND
  • [x] _CALL_METHOD_DESCRIPTOR_O @corona10
  • [x] _STORE_SUBSCR_DICT @corona10
  • [x] _STORE_SUBSCR_LIST_INT @corona10
  • [x] _LOAD_ATTR_INSTANCE_VALUE @Zheaoli
  • [x] _LOAD_ATTR_WITH_HINT @cocolato
  • [x] _LOAD_ATTR_SLOT @Zheaoli
  • [x] _STORE_ATTR_INSTANCE_VALUE @Zheaoli
  • [x] _STORE_ATTR_SLOT @savannahostrowski
  • [x] _STORE_ATTR_WITH_HINT @Zheaoli
  • [x] IS_OP @cocolato
  • [ ] TO_BOOL_ALWAYS_TRUE @reidenong
  • [ ] TO_BOOL_INT @reidenong
  • [ ] TO_BOOL_LIST @reidenong
  • [ ] TO_BOOL_STR @Zheaoli

Once that's done, we can think about further ops.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

  • gh-134588
  • gh-135761
  • gh-135817
  • gh-135818
  • gh-135860
  • gh-135907
  • gh-136056
  • gh-136070
  • gh-136077
  • gh-136104
  • gh-142576
  • gh-142604
  • gh-142620
  • gh-142695
  • gh-142703
  • gh-142711
  • gh-142712
  • gh-142729
  • gh-142759
  • gh-142765
  • gh-142767
  • gh-142769
  • gh-142825
  • gh-142844
  • gh-142921
  • gh-142926
  • gh-142949
  • gh-142952
  • gh-143062
  • gh-143094
  • gh-143107
  • gh-143171
  • gh-143186
  • gh-143320
  • gh-143330
  • gh-143333
  • gh-143336
  • gh-143417
  • gh-143427

Fidget-Spinner avatar May 23 '25 13:05 Fidget-Spinner

@tomasr8 @brandtbucher this is gonna be a pretty sweet optimization :) Before:

    // 
    // _BINARY_OP_ADD_INT.o:  file format elf64-x86-64
    // 
    // Disassembly of section .text:
    // 
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 55                            pushq   %rbp
    // 1: 4d 8b 7d f0                   movq    -0x10(%r13), %r15
    // 5: 49 8b 6d f8                   movq    -0x8(%r13), %rbp
    // 9: 4c 89 ff                      movq    %r15, %rdi
    // c: 48 83 e7 fe                   andq    $-0x2, %rdi
    // 10: 48 89 ee                      movq    %rbp, %rsi
    // 13: 48 83 e6 fe                   andq    $-0x2, %rsi
    // 17: 4d 89 6c 24 40                movq    %r13, 0x40(%r12)
    // 1c: ff 15 00 00 00 00             callq   *(%rip)                 # 0x22 <_JIT_ENTRY+0x22>
    // 000000000000001e:  R_X86_64_GOTPCRELX   _PyLong_Add-0x4
    // 22: 48 89 c3                      movq    %rax, %rbx
    // 25: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
    // 2a: 40 f6 c5 01                   testb   $0x1, %bpl
    // 2e: 75 05                         jne     0x35 <_JIT_ENTRY+0x35>
    // 30: ff 4d 00                      decl    (%rbp)
    // 33: 74 3f                         je      0x74 <_JIT_ENTRY+0x74>
    // 35: 41 f6 c7 01                   testb   $0x1, %r15b
    // 39: 75 71                         jne     0xac <_JIT_ENTRY+0xac>
    // 3b: 41 ff 0f                      decl    (%r15)
    // 3e: 75 6c                         jne     0xac <_JIT_ENTRY+0xac>
    // 40: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
    // 0000000000000042:  R_X86_64_64  _PyRuntime+0x28e0
    // 4a: 48 8b 00                      movq    (%rax), %rax
    // 4d: 48 85 c0                      testq   %rax, %rax
    // 50: 74 17                         je      0x69 <_JIT_ENTRY+0x69>
    // 52: 48 b9 00 00 00 00 00 00 00 00 movabsq $0x0, %rcx
    // 0000000000000054:  R_X86_64_64  _PyRuntime+0x28e8
    // 5c: 48 8b 11                      movq    (%rcx), %rdx
    // 5f: 4c 89 ff                      movq    %r15, %rdi
    // 62: be 01 00 00 00                movl    $0x1, %esi
    // 67: ff d0                         callq   *%rax
    // 69: 4c 89 ff                      movq    %r15, %rdi
    // 6c: ff 15 00 00 00 00             callq   *(%rip)                 # 0x72 <_JIT_ENTRY+0x72>
    // 000000000000006e:  R_X86_64_GOTPCRELX   _PyLong_ExactDealloc-0x4
    // 72: eb 38                         jmp     0xac <_JIT_ENTRY+0xac>
    // 74: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
    // 0000000000000076:  R_X86_64_64  _PyRuntime+0x28e0
    // 7e: 48 8b 00                      movq    (%rax), %rax
    // 81: 48 85 c0                      testq   %rax, %rax
    // 84: 74 17                         je      0x9d <_JIT_ENTRY+0x9d>
    // 86: 48 b9 00 00 00 00 00 00 00 00 movabsq $0x0, %rcx
    // 0000000000000088:  R_X86_64_64  _PyRuntime+0x28e8
    // 90: 48 8b 11                      movq    (%rcx), %rdx
    // 93: 48 89 ef                      movq    %rbp, %rdi
    // 96: be 01 00 00 00                movl    $0x1, %esi
    // 9b: ff d0                         callq   *%rax
    // 9d: 48 89 ef                      movq    %rbp, %rdi
    // a0: ff 15 00 00 00 00             callq   *(%rip)                 # 0xa6 <_JIT_ENTRY+0xa6>
    // 00000000000000a2:  R_X86_64_GOTPCRELX   _PyLong_ExactDealloc-0x4
    // a6: 41 f6 c7 01                   testb   $0x1, %r15b
    // aa: 74 8f                         je      0x3b <_JIT_ENTRY+0x3b>
    // ac: 48 85 db                      testq   %rbx, %rbx
    // af: 74 18                         je      0xc9 <_JIT_ENTRY+0xc9>
    // b1: 0f b7 43 06                   movzwl  0x6(%rbx), %eax
    // b5: 83 e0 01                      andl    $0x1, %eax
    // b8: 48 09 d8                      orq     %rbx, %rax
    // bb: 49 89 45 f0                   movq    %rax, -0x10(%r13)
    // bf: 49 83 c5 f8                   addq    $-0x8, %r13
    // c3: 5d                            popq    %rbp
    // c4: e9 00 00 00 00                jmp     0xc9 <_JIT_ENTRY+0xc9>
    // 00000000000000c5:  R_X86_64_PLT32       _JIT_CONTINUE-0x4
    // c9: 49 83 c5 f0                   addq    $-0x10, %r13
    // cd: 5d                            popq    %rbp
    // ce: e9 00 00 00 00                jmp     0xd3 <_JIT_ENTRY+0xd3>
    // 00000000000000cf:  R_X86_64_PLT32       _JIT_ERROR_TARGET-0x4

After:

    // 
    // _BINARY_OP_ADD_INT__NO_INPUT_DECREF.o: file format elf64-x86-64
    // 
    // Disassembly of section .text:
    // 
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 50                            pushq   %rax
    // 1: 49 8b 7d f0                   movq    -0x10(%r13), %rdi
    // 5: 49 8b 75 f8                   movq    -0x8(%r13), %rsi
    // 9: 48 83 e7 fe                   andq    $-0x2, %rdi
    // d: 48 83 e6 fe                   andq    $-0x2, %rsi
    // 11: 4d 89 6c 24 40                movq    %r13, 0x40(%r12)
    // 16: ff 15 00 00 00 00             callq   *(%rip)                 # 0x1c <_JIT_ENTRY+0x1c>
    // 0000000000000018:  R_X86_64_GOTPCRELX   _PyLong_Add-0x4
    // 1c: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
    // 21: 48 85 c0                      testq   %rax, %rax
    // 24: 74 18                         je      0x3e <_JIT_ENTRY+0x3e>
    // 26: 0f b7 48 06                   movzwl  0x6(%rax), %ecx
    // 2a: 83 e1 01                      andl    $0x1, %ecx
    // 2d: 48 09 c1                      orq     %rax, %rcx
    // 30: 49 89 4d f0                   movq    %rcx, -0x10(%r13)
    // 34: 49 83 c5 f8                   addq    $-0x8, %r13
    // 38: 58                            popq    %rax
    // 39: e9 00 00 00 00                jmp     0x3e <_JIT_ENTRY+0x3e>
    // 000000000000003a:  R_X86_64_PLT32       _JIT_CONTINUE-0x4
    // 3e: 49 83 c5 f0                   addq    $-0x10, %r13
    // 42: 58                            popq    %rax
    // 43: e9 00 00 00 00                jmp     0x48 <_JIT_ENTRY+0x48>
    // 0000000000000044:  R_X86_64_PLT32       _JIT_ERROR_TARGET-0x4

Fidget-Spinner avatar May 23 '25 13:05 Fidget-Spinner

FYI here https://github.com/python/cpython/issues/131798#issuecomment-2997540181

100% want to take a part into this issue. BTW I may need a little bit time to read the whole context and the previous PR made by others.

So maybe I will send the first PR in this weekend. If this is acceptable, 100% want to join into this

Zheaoli avatar Jun 23 '25 18:06 Zheaoli

@Zheaoli and anyone else reading this thread who wants to contribute. No worries, I'm happy to explain and answer any questions over email or video if you want to. My email is on my GH profile.

Fidget-Spinner avatar Jun 23 '25 18:06 Fidget-Spinner

No worries, I'm happy to explain and answer any questions over email or video if you want to.

Just for joking: This is 100% facts(lol. @Fidget-Spinner is a supper nice guy(lol

Zheaoli avatar Jun 23 '25 18:06 Zheaoli

I am also working on _CALL_ISINSTANCE

noamcohen97 avatar Jun 24 '25 09:06 noamcohen97

I have submit the PR for _CALL_STR_1 , I will try _CALL_LIST_APPEND if you guys don't mind it

Zheaoli avatar Jun 28 '25 10:06 Zheaoli

I have submit the PR for _CALL_STR_1 , I will try _CALL_LIST_APPEND if you guys don't mind it

Thanks. Btw, we're waiting for https://github.com/python/cpython/pull/135465/files to get merged first before we can continue. So you can take a rest first :).

Fidget-Spinner avatar Jun 28 '25 10:06 Fidget-Spinner

This issue is blocked till we get TOS caching in. So please don't work on it anymore until that is merged. Thank you!

Fidget-Spinner avatar Jul 06 '25 12:07 Fidget-Spinner

May I ask what is status about TOS caching?(lol

Zheaoli avatar Sep 27 '25 08:09 Zheaoli

We're still waiting on Mark

Fidget-Spinner avatar Sep 27 '25 08:09 Fidget-Spinner

We're still waiting on Mark

Got it, I will try to handle another part

Zheaoli avatar Sep 27 '25 13:09 Zheaoli