Better uop coverage in the JIT optimizer
Out of 263 total uops, 155 of these are ignored by the tier two optimizer. These represent over half of all uops by dynamic execution count.
This issue will serve as a checklist for auditing these missing uops, and adding them where they make sense. At first glance, there's quite a bit of potential here... especially around ability to narrow known output types (like _CONTAINS_OP_SET), and the ability to narrow and remove guards on input types (like _BINARY_OP_SUBSCR_LIST_INT). As I'm going through, I'll ~cross out~ anything that doesn't seem like it makes sense to add.
First, here are the 53 missing uops that each represent at least 0.1% of all uops executed:
- [ ] ~
_SET_IP~ (12.1%) - [ ] ~
_CHECK_VALIDITY~ (10.1%) - [ ] ~
_CHECK_VALIDITY_AND_SET_IP~ (6.5%) - [ ] ~
_CHECK_PERIODIC~ (3.1%) - [ ] ~
_MAKE_WARM~ (2.8%) - [ ] ~
_START_EXECUTOR~ (1.7%) - [x]
_GUARD_NOS_INT(1.5%) - [ ] ~
_BINARY_OP_SUBSCR_LIST_INT~ (1.0%) - [ ]
_CHECK_FUNCTION(1.0%) - [ ] ~
_CHECK_MANAGED_OBJECT_HAS_VALUES~ (0.7%) - [ ]
_ITER_CHECK_LIST(0.7%) - [x]
_CONTAINS_OP_SET(0.6%) - [ ]
_FOR_ITER_TIER_TWO(0.6%) - [ ] ~
_GUARD_NOT_EXHAUSTED_LIST~ (0.6%) - [ ] ~
_ITER_NEXT_LIST_TIER_TWO~ (0.6%) - [ ] ~
_SAVE_RETURN_OFFSET~ (0.6%) - [x]
_CALL_LEN(0.5%) - [x]
_CALL_LIST_APPEND(0.5%) - [ ]
_POP_TOP(0.5%) - [ ]
_RESUME_CHECK(0.5%) - [x]
_BINARY_OP_SUBSCR_STR_INT(0.4%) - [ ] ~
_GUARD_DORV_VALUES_INST_ATTR_FROM_DICT~ (0.4%) - [ ] ~
_GUARD_KEYS_VERSION~ (0.4%) - [ ] ~
_BINARY_OP_SUBSCR_DICT~ (0.3%) - [ ]
_CALL_BUILTIN_FAST(0.3%) - [ ]
_CHECK_STACK_SPACE_OPERAND(0.3%) - [ ]
_GET_ITER(0.3%) - [ ] ~
_STORE_SUBSCR~ (0.3%) - [ ] ~
_GUARD_NOT_EXHAUSTED_RANGE~ (0.2%) - [x]
_BINARY_SLICE(0.2%) - [x]
_BUILD_LIST(0.2%) - [ ]
_CALL_BUILTIN_O(0.2%) - [ ]
_CALL_NON_PY_GENERAL(0.2%) - [ ]
_CHECK_IS_NOT_PY_CALLABLE(0.2%) - [x]
_GUARD_NOS_FLOAT(0.2%) - [ ]
_ITER_CHECK_RANGE(0.2%) - [ ]
_ITER_CHECK_TUPLE(0.2%) - [ ]
_LOAD_DEREF(0.2%) - [ ] ~
_STORE_SUBSCR_LIST_INT~ (0.2%) - [ ]
_BINARY_OP_EXTEND(0.1%) - [x]
_CALL_ISINSTANCE(0.1%) - [ ]
_CALL_METHOD_DESCRIPTOR_FAST(0.1%) - [ ]
_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS(0.1%) - [ ]
_CALL_METHOD_DESCRIPTOR_NOARGS(0.1%) - [x]
_CALL_TYPE_1(0.1%) - [x]
_CHECK_ATTR_CLASS(0.1%) - [x]
_CONTAINS_OP_DICT(0.1%) - [x]
_GUARD_BINARY_OP_EXTEND(0.1%) - [ ] ~
_GUARD_NOT_EXHAUSTED_TUPLE~ (0.1%) - [ ] ~
_ITER_NEXT_TUPLE~ (0.1%) - [ ] ~
_LIST_APPEND~ (0.1%) - [ ] ~
_STORE_ATTR_SLOT~ (0.1%) - [ ] ~
_STORE_SUBSCR_DICT~ (0.1%)
And here are the 102 missing uops that are less than 0.1%. These are less important, but still may net us some wins on individual benchmarks:
- [ ]
_BINARY_OP_SUBSCR_CHECK_FUNC - [x]
_BINARY_OP_SUBSCR_TUPLE_INT - [x]
_BUILD_MAP - [x]
_BUILD_SET - [x]
_BUILD_SLICE - [x]
_BUILD_STRING - [ ]
_CALL_BUILTIN_CLASS - [ ]
_CALL_BUILTIN_FAST_WITH_KEYWORDS - [ ]
_CALL_INTRINSIC_1 - [ ]
_CALL_INTRINSIC_2 - [ ]
_CALL_KW_NON_PY - [ ]
_CALL_METHOD_DESCRIPTOR_O - [x]
_CALL_STR_1 - [x]
_CALL_TUPLE_1 - [ ]
_CHECK_ATTR_METHOD_LAZY_DICT - [ ]
_CHECK_EG_MATCH - [ ]
_CHECK_EXC_MATCH - [ ]
_CHECK_FUNCTION_VERSION_INLINE - [ ]
_CHECK_FUNCTION_VERSION_KW - [ ]
_CHECK_IS_NOT_PY_CALLABLE_KW - [ ]
_CHECK_METHOD_VERSION - [ ]
_CHECK_METHOD_VERSION_KW - [ ]
_CHECK_PERIODIC_IF_NOT_YIELD_FROM - [ ]
_CONVERT_VALUE - [ ]
_COPY_FREE_VARS - [ ]
_DELETE_ATTR - [ ]
_DELETE_DEREF - [ ]
_DELETE_FAST - [ ]
_DELETE_GLOBAL - [ ]
_DELETE_NAME - [ ]
_DELETE_SUBSCR - [ ]
_DEOPT - [ ]
_DICT_MERGE - [ ]
_DICT_UPDATE - [ ]
_END_FOR - [ ]
_END_SEND - [ ]
_ERROR_POP_N - [ ]
_EXIT_INIT_CHECK - [ ]
_EXPAND_METHOD - [ ]
_EXPAND_METHOD_KW - [ ]
_FATAL_ERROR - [ ]
_FORMAT_SIMPLE - [ ]
_FORMAT_WITH_SPEC - [ ]
_GET_AITER - [ ]
_GET_ANEXT - [ ]
_GET_AWAITABLE - [x]
_GET_LEN - [ ]
_GET_YIELD_FROM_ITER - [ ]
_GUARD_DORV_NO_DICT - [ ]
_GUARD_GLOBALS_VERSION - [x]
_GUARD_TOS_FLOAT - [x]
_GUARD_TOS_INT - [ ]
_GUARD_TYPE_VERSION_AND_LOCK - [ ]
_IMPORT_FROM - [ ]
_IMPORT_NAME - [ ]
_IS_NONE - [ ]
_LIST_EXTEND - [x]
_LOAD_ATTR_NONDESCRIPTOR_NO_DICT - [x]
_LOAD_ATTR_NONDESCRIPTOR_WITH_VALUES - [ ]
_LOAD_BUILD_CLASS - [ ]
_LOAD_COMMON_CONSTANT - [ ]
_LOAD_FAST_LOAD_FAST - [ ]
_LOAD_FROM_DICT_OR_DEREF - [ ]
_LOAD_GLOBAL - [ ]
_LOAD_GLOBAL_BUILTINS - [ ]
_LOAD_GLOBAL_MODULE - [ ]
_LOAD_LOCALS - [ ]
_LOAD_NAME - [ ]
_LOAD_SUPER_ATTR_ATTR - [ ]
_LOAD_SUPER_ATTR_METHOD - [ ]
_MAKE_CALLARGS_A_TUPLE - [ ]
_MAKE_CELL - [ ]
_MAKE_FUNCTION - [ ]
_MAP_ADD - [ ]
_MATCH_CLASS - [ ]
_MATCH_KEYS - [ ]
_MATCH_MAPPING - [ ]
_MATCH_SEQUENCE - [ ]
_MAYBE_EXPAND_METHOD_KW - [ ] ~
_NOP~ - [ ]
_POP_EXCEPT - [ ]
_POP_TWO_LOAD_CONST_INLINE_BORROW - [ ]
_PUSH_EXC_INFO - [ ]
_PUSH_NULL_CONDITIONAL - [ ]
_SETUP_ANNOTATIONS - [ ]
_SET_ADD - [ ]
_SET_FUNCTION_ATTRIBUTE - [ ]
_SET_UPDATE - [ ]
_STORE_ATTR - [ ]
_STORE_ATTR_INSTANCE_VALUE - [ ]
_STORE_ATTR_WITH_HINT - [ ]
_STORE_DEREF - [ ]
_STORE_FAST_LOAD_FAST - [ ]
_STORE_FAST_STORE_FAST - [ ]
_STORE_GLOBAL - [ ]
_STORE_NAME - [ ] ~
_STORE_SLICE~ - [ ]
_TIER2_RESUME_CHECK - [x]
_UNARY_INVERT - [ ]
_UNARY_NEGATIVE - [ ] ~
_UNPACK_SEQUENCE_LIST~ - [ ]
_WITH_EXCEPT_START
Linked PRs
- gh-131800
- gh-131816
- gh-132057
- gh-132153
- gh-132269
- gh-132278
- gh-132289
- gh-132419
- gh-132434
- gh-132564
- gh-132849
- gh-132851
- gh-132940
- gh-133003
- gh-133172
- gh-133180
- gh-133339
- gh-133345
- gh-133527
- gh-133609
- gh-134240
- gh-134268
- gh-134369
- gh-134373
- gh-134403
- gh-134406
- gh-134543
- gh-134554
- gh-134671
- gh-134803
- gh-135022
- gh-135194
- gh-135222
- gh-135223
- gh-135260
- gh-137607
- gh-143340
@diegorusso is going to add _CALL_LEN.
@fluhus is going to add _BINARY_SLICE.
@Klaus117 is going to improve _TO_BOOL_INT.
@Zheaoli is going to add _CONTAINS_OP_DICT.
I think I can work on _BINARY_OP_SUBSCR_LIST_INT and _BINARY_OP_SUBSCR_DICT
I think I can work on _BINARY_OP_SUBSCR_LIST_INT and _BINARY_OP_SUBSCR_DICT
Sorry, I already have a branch to do the guards for these (and a couple others) that I was going to up in a minute! I'll tag you for review though.
@tomasr8 is going to add _CALL_STR_1, _CALL_TUPLE_1, and _CALL_TYPE_1.
@Zheaoli, want to take _BUILD_LIST, _BUILD_MAP, _BUILD_SET, _BUILD_SLICE, and _BUILD_STRING? For all but _BUILD_SLICE and _BUILD_STRING, I think we can only set the type of the output (sym_new_type). For _BUILD_SLICE and _BUILD_STRING, we may be able to have a constant output if the items are constant (sym_is_const/sym_get_const/sym_new_const). Maybe one PR for the first three, and separate PRs for _BUILD_SLICE and _BUILD_STRING?
want to take _BUILD_LIST, _BUILD_MAP, _BUILD_SET, _BUILD_SLICE, and _BUILD_STRING? For all but _BUILD_SLICE and _BUILD_STRING, I think we can only set the type of the output (sym_new_type). For _BUILD_SLICE and _BUILD_STRING, we may be able to have a constant output if the items are constant
100% want to(lol
I'm on the way
I want to join , I am the starter of python, I've only ever written test scripts in python at work, and have seen its charms. I should start reading from that part of my knowledge, if I may.
How can I join this?
Not sure I can do it well, but I would like to try my best :)
Sorry, I think at this point we already have enough people working on this issue (it's also a pretty tough one if you've never contributed to CPython before). Thanks for the interest in helping out, though!
@Zheaoli, want to take
_BUILD_LIST,_BUILD_MAP,_BUILD_SET,_BUILD_SLICE, and_BUILD_STRING? For all but_BUILD_SLICEand_BUILD_STRING, I think we can only set the type of the output (sym_new_type). For_BUILD_SLICEand_BUILD_STRING, we may be able to have a constant output if the items are constant (sym_is_const/sym_get_const/sym_new_const). Maybe one PR for the first three, and separate PRs for_BUILD_SLICEand_BUILD_STRING?
Sorry for delay in days, I might have some issues here. I have noticed that we have a similar uop code before, https://github.com/python/cpython/blob/main/Python/optimizer_bytecodes.c#L918-L920. It's seems create in https://github.com/python/cpython/pull/128940/files
I'm not sure we need to create sym_new_list like sym_new_tuple or we just need sym_new_type(ctx, &PyList_Type);
I'm not sure we need to create sym_new_list like sym_new_tuple or we just need
sym_new_type(ctx, &PyList_Type);
For these, we just want to set the type. The symbolic tuple representation is a special case, since tuples are immutable. Tracking the values in a list is much harder, since the list can change in many different ways. For example:
t = (a, b)
l = [a, b]
foo(t, l)
x = t[0] # The JIT can prove that x is a.
y = l[0] # l may have been mutated by foo, so we can't prove anything.
The symbolic tuple representation allows us to track symbolic values inside of a tuple, even if they aren't constant. For instance, if we later guard that x is an int, then the JIT has also proven that a is an int as well (and vice-versa!). We may choose to (very carefully!) track the contents of mutable containers or object attributes in the future, but in general this is much harder to do correctly, and pays off in fewer cases.
For these, we just want to set the type. The symbolic tuple representation is a special case, since tuples are immutable. Tracking the values in a list is much harder, since the list can change in many different ways. For example:
I see, for now, we just need to replace sym_new_not_null to sym_new_type, it can get some performance improvement(maybe)
----Following discussion is just personal thought---
The symbolic tuple representation allows us to track symbolic values inside of a tuple, even if they aren't constant. For instance, if we later guard that
xis anint, then the JIT has also proven thatais anintas well (and vice-versa!). We may choose to (very carefully!) track the contents of mutable containers or object attributes in the future, but in general this is much harder to do correctly, and pays off in fewer cases.
I got your point. So, Mark setup the sym_new_tuple here because it's immutable. In other words, the data built from _BUILD_TUPLE is predictable. So we can setup extra optimization path for tuple. But for mutable type like set, map, list , the optimization path is dangerous and not useful
I'm not sure my thought is right, correct me if I'm wrong plz.
Yep, your understanding is correct!
@Zheaoli is going to add _GET_LEN.
@tomasr8 is going to add _CALL_ISINSTANCE.
@diegorusso is going to add _CALL_LIST_APPEND.
@brandtbucher I'm trying to cover more UOP in optimizer. But I'm not sure about the priority of the rest of the UOP. Would you mind giving me some suggestions?
Sure! I think we can turn _CHECK_METHOD_VERSION into _CHECK_FUNCTION_VERSION_INLINE. Check out the optimizer case for _CHECK_FUNCTION_VERSION to see how it's done. This would work the same way, except you would check for &PyMethod_Type, and use the function from ((PyMethodObject *)callable)->im_func as the operand1.
Let me know if you have any questions.
Sure! I think we can turn
_CHECK_METHOD_VERSIONinto_CHECK_FUNCTION_VERSION_INLINE. Check out the optimizer case for_CHECK_FUNCTION_VERSIONto see how it's done. This would work the same way, except you would check for&PyMethod_Type, and use the function from((PyMethodObject *)callable)->im_funcas theoperand1.Let me know if you have any questions.
Thanks for the suggestion! I might need a little bit time. I think I can draft a PR this week
Reserving _ITER_NEXT_TUPLE for @noamcohen97.
@brandtbucher I would like to try some _LOAD_ prefixed uop. My current idea is that some of the uop will be more power by set some information like const?
@Zheaoli do you mind if I redirect you do _CALL_STR_1 here? https://github.com/python/cpython/issues/134584
How can I join this?
The follow uops still have no entry in optimizer_bytecode.c:
_CHECK_PERIODIC _CHECK_PERIODIC_IF_NOT_YIELD_FROM _RESUME_CHECK _POP_TWO _END_FOR _POP_ITER _END_SEND _GUARD_TOS_SLICE _GUARD_NOS_OVERFLOWED _GUARD_TOS_OVERFLOWED _GUARD_BINARY_OP_EXTEND _BINARY_OP_EXTEND _STORE_SLICE _BINARY_OP_SUBSCR_LIST_SLICE _BINARY_OP_SUBSCR_DICT _BINARY_OP_SUBSCR_CHECK_FUNC _LIST_APPEND _SET_ADD _STORE_SUBSCR _DELETE_SUBSCR _CALL_INTRINSIC_1 _CALL_INTRINSIC_2 _GET_AITER _GET_ANEXT _GET_AWAITABLE _POP_EXCEPT _LOAD_COMMON_CONSTANT _LOAD_BUILD_CLASS _STORE_NAME _DELETE_NAME _UNPACK_SEQUENCE_LIST _STORE_ATTR _DELETE_ATTR _STORE_GLOBAL _DELETE_GLOBAL _LOAD_LOCALS _LOAD_NAME _LOAD_GLOBAL _DELETE_FAST _MAKE_CELL _DELETE_DEREF _LOAD_FROM_DICT_OR_DEREF _LOAD_DEREF _STORE_DEREF _COPY_FREE_VARS _BUILD_INTERPOLATION _BUILD_TEMPLATE _LIST_EXTEND _SET_UPDATE _SETUP_ANNOTATIONS _DICT_UPDATE _DICT_MERGE _MAP_ADD _LOAD_SUPER_ATTR_ATTR _LOAD_SUPER_ATTR_METHOD _GUARD_TYPE_VERSION_AND_LOCK _CHECK_MANAGED_OBJECT_HAS_VALUES _GUARD_DORV_NO_DICT _CHECK_EG_MATCH _CHECK_EXC_MATCH _IMPORT_NAME _IMPORT_FROM _IS_NONE _MATCH_CLASS _MATCH_MAPPING _MATCH_SEQUENCE _MATCH_KEYS _GET_YIELD_FROM_ITER _FOR_ITER_TIER_TWO _ITER_CHECK_LIST _GUARD_NOT_EXHAUSTED_LIST _ITER_NEXT_LIST_TIER_TWO _GUARD_NOT_EXHAUSTED_TUPLE _ITER_NEXT_TUPLE _ITER_CHECK_RANGE _GUARD_NOT_EXHAUSTED_RANGE _WITH_EXCEPT_START _PUSH_EXC_INFO _GUARD_DORV_VALUES_INST_ATTR_FROM_DICT _GUARD_KEYS_VERSION _CHECK_ATTR_METHOD_LAZY_DICT _CHECK_FUNCTION_VERSION_INLINE _EXPAND_METHOD _CHECK_IS_NOT_PY_CALLABLE _CALL_NON_PY_GENERAL _CHECK_RECURSION_REMAINING _EXIT_INIT_CHECK _CALL_BUILTIN_CLASS _CALL_BUILTIN_O _CALL_BUILTIN_FAST _CALL_BUILTIN_FAST_WITH_KEYWORDS _CALL_METHOD_DESCRIPTOR_O _CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS _CALL_METHOD_DESCRIPTOR_NOARGS _CALL_METHOD_DESCRIPTOR_FAST _MAYBE_EXPAND_METHOD_KW _CHECK_FUNCTION_VERSION_KW _CHECK_METHOD_VERSION_KW _EXPAND_METHOD_KW _CHECK_IS_NOT_PY_CALLABLE_KW _CALL_KW_NON_PY _MAKE_CALLARGS_A_TUPLE _MAKE_FUNCTION _SET_FUNCTION_ATTRIBUTE _CONVERT_VALUE _FORMAT_SIMPLE _FORMAT_WITH_SPEC _SET_IP _SAVE_RETURN_OFFSET _CHECK_VALIDITY _POP_CALL _POP_CALL_ONE _POP_CALL_TWO _POP_TWO_LOAD_CONST_INLINE_BORROW _LOAD_CONST_UNDER_INLINE _LOAD_CONST_UNDER_INLINE_BORROW _START_EXECUTOR _MAKE_WARM _FATAL_ERROR _HANDLE_PENDING_AND_DEOPT _ERROR_POP_N _TIER2_RESUME_CHECK _GUARD_IP__PUSH_FRAME _GUARD_IP_YIELD_VALUE _GUARD_IP_RETURN_VALUE _GUARD_IP_RETURN_GENERATOR
Probably the most critical ones are those that deal with frames. E.g. _FOR_ITER_GEN_FRAME and _SEND_GEN_FRAME both just terminate the optimizer. Those are blocking us from optimizing generators.