Feature or enhancement

Proposal:

Thanks to Matt's work on borrowed LOAD_FAST, we can now eliminate reference counting trivially in the JIT.

Reference counting is expensive, Matt found that eliminating 90% of refcounts in LOAD_FAST meant a 2-3% speedup in the interpreter. So the speedup for the JIT should be quite a bit too.

The other problem is that reference counts block register allocation/TOS caching. As they force spills to the stack more often.

This issue has the potential to speedup up JIT benchmarks by several percent.

How to contribute.

Now that reference tracking in the JIT optimizer is in place, the first thing we need to do is to convert ops to make the decref an explicit uop.

For escaping decref ops (ie, ops that decref could run the GC), we would need to refactor them so their decrefs are eliminated via specialization of pops. For example: the following op (which is not an escaping op, but just purely for demonstration!):

macro(BINARY_OP_ADD_INT) =
_GUARD_TOS_INT + _GUARD_NOS_INT + unused/5 + _BINARY_OP_ADD_INT;

becomes

macro(BINARY_OP_ADD_INT) =
_GUARD_TOS_INT + _GUARD_NOS_INT + unused/5 + _BINARY_OP_ADD_INT + _POP_TOP_INT + _POP_TOP_INT;

Previously _BINARY_OP_ADD_INT's stack effect looked like this: (left, right -- res). The new version should look like (left, right -- res, left, right). So for all the decref specializations, we would just need a _POP_X of their variants! This means no explosion of uop and their decref variants. We just specialize _POP_X to _POP_X_NO_DECREF in the JIT. Keeping things clean

These are open for contributors to take:

[X] POP_TOP @Fidget-Spinner
[X] STORE_FAST @Fidget-Spinner
[x] _BINARY_OP_X_FLOAT @Fidget-Spinner
[x] _BINARY_OP_X_INT @Fidget-Spinner
[x] _BINARY_OP_ADD_UNICODE @corona10
[x] _BINARY_OP_SUBSCR_LIST_INT @cocolato
[x] _BINARY_OP_SUBSCR_TUPLE_INT @cocolato
[x] _BINARY_OP_SUBSCR_STR_INT @savannahostrowski
[ ] _BINARY_OP_SUBSCR_DICT @caje731
[x] _COMPARE_OP_X @cocolato
[x] _CALL_TYPE_1 @tomasr8
[x] _CALL_STR_1 @Zheaoli
[x] _CALL_TUPLE_1 @noamcohen97
[x] _CALL_BUILTIN_O @AndPuQing
[x] _CALL_LEN @corona10
[ ] _CALL_ISINSTANCE @noamcohen97
[x] _CALL_LIST_APPEND
[x] _CALL_METHOD_DESCRIPTOR_O @corona10
[x] _STORE_SUBSCR_DICT @corona10
[x] _STORE_SUBSCR_LIST_INT @corona10
[x] _LOAD_ATTR_INSTANCE_VALUE @Zheaoli
[x] _LOAD_ATTR_WITH_HINT @cocolato
[x] _LOAD_ATTR_SLOT @Zheaoli
[x] _STORE_ATTR_INSTANCE_VALUE @Zheaoli
[x] _STORE_ATTR_SLOT @savannahostrowski
[x] _STORE_ATTR_WITH_HINT @Zheaoli
[x] IS_OP @cocolato
[ ] TO_BOOL_ALWAYS_TRUE @reidenong
[ ] TO_BOOL_INT @reidenong
[ ] TO_BOOL_LIST @reidenong
[ ] TO_BOOL_STR @Zheaoli

Once that's done, we can think about further ops.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

gh-134588
gh-135761
gh-135817
gh-135818
gh-135860
gh-135907
gh-136056
gh-136070
gh-136077
gh-136104
gh-142576
gh-142604
gh-142620
gh-142695
gh-142703
gh-142711
gh-142712
gh-142729
gh-142759
gh-142765
gh-142767
gh-142769
gh-142825
gh-142844
gh-142921
gh-142926
gh-142949
gh-142952
gh-143062
gh-143094
gh-143107
gh-143171
gh-143186
gh-143320
gh-143330
gh-143333
gh-143336
gh-143417
gh-143427

May 23 '25 13:05 Fidget-Spinner

@tomasr8 @brandtbucher this is gonna be a pretty sweet optimization :) Before:

    // 
    // _BINARY_OP_ADD_INT.o:  file format elf64-x86-64
    // 
    // Disassembly of section .text:
    // 
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 55                            pushq   %rbp
    // 1: 4d 8b 7d f0                   movq    -0x10(%r13), %r15
    // 5: 49 8b 6d f8                   movq    -0x8(%r13), %rbp
    // 9: 4c 89 ff                      movq    %r15, %rdi
    // c: 48 83 e7 fe                   andq    $-0x2, %rdi
    // 10: 48 89 ee                      movq    %rbp, %rsi
    // 13: 48 83 e6 fe                   andq    $-0x2, %rsi
    // 17: 4d 89 6c 24 40                movq    %r13, 0x40(%r12)
    // 1c: ff 15 00 00 00 00             callq   *(%rip)                 # 0x22 <_JIT_ENTRY+0x22>
    // 000000000000001e:  R_X86_64_GOTPCRELX   _PyLong_Add-0x4
    // 22: 48 89 c3                      movq    %rax, %rbx
    // 25: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
    // 2a: 40 f6 c5 01                   testb   $0x1, %bpl
    // 2e: 75 05                         jne     0x35 <_JIT_ENTRY+0x35>
    // 30: ff 4d 00                      decl    (%rbp)
    // 33: 74 3f                         je      0x74 <_JIT_ENTRY+0x74>
    // 35: 41 f6 c7 01                   testb   $0x1, %r15b
    // 39: 75 71                         jne     0xac <_JIT_ENTRY+0xac>
    // 3b: 41 ff 0f                      decl    (%r15)
    // 3e: 75 6c                         jne     0xac <_JIT_ENTRY+0xac>
    // 40: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
    // 0000000000000042:  R_X86_64_64  _PyRuntime+0x28e0
    // 4a: 48 8b 00                      movq    (%rax), %rax
    // 4d: 48 85 c0                      testq   %rax, %rax
    // 50: 74 17                         je      0x69 <_JIT_ENTRY+0x69>
    // 52: 48 b9 00 00 00 00 00 00 00 00 movabsq $0x0, %rcx
    // 0000000000000054:  R_X86_64_64  _PyRuntime+0x28e8
    // 5c: 48 8b 11                      movq    (%rcx), %rdx
    // 5f: 4c 89 ff                      movq    %r15, %rdi
    // 62: be 01 00 00 00                movl    $0x1, %esi
    // 67: ff d0                         callq   *%rax
    // 69: 4c 89 ff                      movq    %r15, %rdi
    // 6c: ff 15 00 00 00 00             callq   *(%rip)                 # 0x72 <_JIT_ENTRY+0x72>
    // 000000000000006e:  R_X86_64_GOTPCRELX   _PyLong_ExactDealloc-0x4
    // 72: eb 38                         jmp     0xac <_JIT_ENTRY+0xac>
    // 74: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
    // 0000000000000076:  R_X86_64_64  _PyRuntime+0x28e0
    // 7e: 48 8b 00                      movq    (%rax), %rax
    // 81: 48 85 c0                      testq   %rax, %rax
    // 84: 74 17                         je      0x9d <_JIT_ENTRY+0x9d>
    // 86: 48 b9 00 00 00 00 00 00 00 00 movabsq $0x0, %rcx
    // 0000000000000088:  R_X86_64_64  _PyRuntime+0x28e8
    // 90: 48 8b 11                      movq    (%rcx), %rdx
    // 93: 48 89 ef                      movq    %rbp, %rdi
    // 96: be 01 00 00 00                movl    $0x1, %esi
    // 9b: ff d0                         callq   *%rax
    // 9d: 48 89 ef                      movq    %rbp, %rdi
    // a0: ff 15 00 00 00 00             callq   *(%rip)                 # 0xa6 <_JIT_ENTRY+0xa6>
    // 00000000000000a2:  R_X86_64_GOTPCRELX   _PyLong_ExactDealloc-0x4
    // a6: 41 f6 c7 01                   testb   $0x1, %r15b
    // aa: 74 8f                         je      0x3b <_JIT_ENTRY+0x3b>
    // ac: 48 85 db                      testq   %rbx, %rbx
    // af: 74 18                         je      0xc9 <_JIT_ENTRY+0xc9>
    // b1: 0f b7 43 06                   movzwl  0x6(%rbx), %eax
    // b5: 83 e0 01                      andl    $0x1, %eax
    // b8: 48 09 d8                      orq     %rbx, %rax
    // bb: 49 89 45 f0                   movq    %rax, -0x10(%r13)
    // bf: 49 83 c5 f8                   addq    $-0x8, %r13
    // c3: 5d                            popq    %rbp
    // c4: e9 00 00 00 00                jmp     0xc9 <_JIT_ENTRY+0xc9>
    // 00000000000000c5:  R_X86_64_PLT32       _JIT_CONTINUE-0x4
    // c9: 49 83 c5 f0                   addq    $-0x10, %r13
    // cd: 5d                            popq    %rbp
    // ce: e9 00 00 00 00                jmp     0xd3 <_JIT_ENTRY+0xd3>
    // 00000000000000cf:  R_X86_64_PLT32       _JIT_ERROR_TARGET-0x4

After:

    // 
    // _BINARY_OP_ADD_INT__NO_INPUT_DECREF.o: file format elf64-x86-64
    // 
    // Disassembly of section .text:
    // 
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 50                            pushq   %rax
    // 1: 49 8b 7d f0                   movq    -0x10(%r13), %rdi
    // 5: 49 8b 75 f8                   movq    -0x8(%r13), %rsi
    // 9: 48 83 e7 fe                   andq    $-0x2, %rdi
    // d: 48 83 e6 fe                   andq    $-0x2, %rsi
    // 11: 4d 89 6c 24 40                movq    %r13, 0x40(%r12)
    // 16: ff 15 00 00 00 00             callq   *(%rip)                 # 0x1c <_JIT_ENTRY+0x1c>
    // 0000000000000018:  R_X86_64_GOTPCRELX   _PyLong_Add-0x4
    // 1c: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
    // 21: 48 85 c0                      testq   %rax, %rax
    // 24: 74 18                         je      0x3e <_JIT_ENTRY+0x3e>
    // 26: 0f b7 48 06                   movzwl  0x6(%rax), %ecx
    // 2a: 83 e1 01                      andl    $0x1, %ecx
    // 2d: 48 09 c1                      orq     %rax, %rcx
    // 30: 49 89 4d f0                   movq    %rcx, -0x10(%r13)
    // 34: 49 83 c5 f8                   addq    $-0x8, %r13
    // 38: 58                            popq    %rax
    // 39: e9 00 00 00 00                jmp     0x3e <_JIT_ENTRY+0x3e>
    // 000000000000003a:  R_X86_64_PLT32       _JIT_CONTINUE-0x4
    // 3e: 49 83 c5 f0                   addq    $-0x10, %r13
    // 42: 58                            popq    %rax
    // 43: e9 00 00 00 00                jmp     0x48 <_JIT_ENTRY+0x48>
    // 0000000000000044:  R_X86_64_PLT32       _JIT_ERROR_TARGET-0x4

May 23 '25 13:05 Fidget-Spinner

FYI here https://github.com/python/cpython/issues/131798#issuecomment-2997540181

100% want to take a part into this issue. BTW I may need a little bit time to read the whole context and the previous PR made by others.

So maybe I will send the first PR in this weekend. If this is acceptable, 100% want to join into this

Jun 23 '25 18:06 Zheaoli

@Zheaoli and anyone else reading this thread who wants to contribute. No worries, I'm happy to explain and answer any questions over email or video if you want to. My email is on my GH profile.

Jun 23 '25 18:06 Fidget-Spinner

No worries, I'm happy to explain and answer any questions over email or video if you want to.

Just for joking: This is 100% facts(lol. @Fidget-Spinner is a supper nice guy(lol

Jun 23 '25 18:06 Zheaoli

I am also working on _CALL_ISINSTANCE

Jun 24 '25 09:06 noamcohen97

I have submit the PR for _CALL_STR_1 , I will try _CALL_LIST_APPEND if you guys don't mind it

Jun 28 '25 10:06 Zheaoli

I have submit the PR for _CALL_STR_1 , I will try _CALL_LIST_APPEND if you guys don't mind it

Thanks. Btw, we're waiting for https://github.com/python/cpython/pull/135465/files to get merged first before we can continue. So you can take a rest first :).

Jun 28 '25 10:06 Fidget-Spinner

This issue is blocked till we get TOS caching in. So please don't work on it anymore until that is merged. Thank you!

Jul 06 '25 12:07 Fidget-Spinner

May I ask what is status about TOS caching?(lol

Sep 27 '25 08:09 Zheaoli

We're still waiting on Mark

Sep 27 '25 08:09 Fidget-Spinner

We're still waiting on Mark

Got it, I will try to handle another part

Sep 27 '25 13:09 Zheaoli

Eliminate redundant refcounting in the JIT

Feature or enhancement

Proposal:

How to contribute.

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs